查看HBase中user表中的数据: hbase(main):003:0> scan 'user' 以下是样例输出,实际包含五百多行。
8 部署Mahout数据挖据工具
8.1 部署Mahout
进入先电大数据平台主界面,点击左侧的动作按钮,添加Mahout服务。
8.2 运行案例
# slaver1
#su - mahout $ mahout
8.2.1 实现朴素贝叶斯分类器
# tar -zxvf mkdir 20news # mv 20news-bydate-test 20news # mv 20news-bydate-train 20news # cd 20news
# hadoop fs -mkdir /data/mahout/20news
# hadoop fs -mkdir /data/mahout/20news/20news-all # hadoop fs -put * /data/mahout/20news/20news-all 将测试文件转化为Hadoop序列文件,命令如下 #
mahout
seqdirectory
-i
/data/mahout/20news/20news-all
-o
/data/mahout/20news/output/20news-seq
使用hadoop fs -text命令行选项检验序列文件输出结果,命令如下
# hadoop fs -text /data/mahout/20news/output/20news-seq/part-m-00000 |more 样例输出类似如下:
序列文件创建完成后,还没有对单词和文本作任何分析。贝叶斯算法不能直接工作在单词和未加工的文本上,但是可以工作在与原始文档有关联的权重向量上。现在需要把原始文本转化为权重和频率向量。命令如下:
#
mahout
seq2sparse
-i
/data/mahout/20news/output/20news-seq
-o
/data/mahout/20news/output/20news-vectors -lnorm -nv -wt tfidf
# hadoop fs -ls /data/mahout/20news/output/20news-vectors
# mahout split -i /data/mahout/20news/output/20news-vectors/tfidf-vectors --trainingOutput /data/mahout/20news/output/20news-train-vectors --testOutput /data/mahout/20news/output/20news-test-vectors --overwrite --sequenceFiles -xm sequential
在训练向量集上训练朴素贝叶斯分类器,使用以下命令:
# mahout trainnb -i /data/mahout/20news/output/20news-train-vectors -el -o /data/mahout/20news/output/model -li /data/mahout/20news/output/labelindex –ow
# mahout testnb -i /data/mahout/20news/output/20news-test-vectors -m
--randomSelectionPct
40
/data/mahout/20news/output/model -l /data/mahout/20news/output/labelindex -ow -o /data/mahout/20news/output/20news-testing
8.2.2 基于项目的协同过滤
# hadoop fs -mkdir /data/mahout/project-collaborative # hadoop fs -put /data/mahout/project-collaborative
# mahout recommenditembased -i /data/mahout/project-collaborative/ -o /data/mahout/project-collaborative/output SIMILARITY_EUCLIDEAN_DISTANCE
-n
3 7
-b
false
-s 2
--maxPrefsPerUser --minPrefsPerUser
--maxPrefsInItemSimilarity 7 --tempDir /data/mahout/project-collaborative/temp
--input --output -- numRecommendations(-n) --usersFile 需要做出推荐的user,默认全部做推荐 偏好数据路径,文本文件。格式 userid\\t itemid\\t preference 推荐结果路径 推荐个数