分布式数据流聚类算法及其基于Storm的实现
万新贵;李玲娟;马可
【期刊名称】《计算机技术与发展》 【年(卷),期】2017(027)007
【摘要】为了提高数据流聚类算法的效率,设计并提出了基于质心距离和密度网格的数据流聚类算法-CDD-Stream,并通过对其中网格结构的更新实施了并行化策略,进而设计并提出了一种分布式数据流聚类算法-DCD-Stream(Distributed Centroid Distance D-Stream).该算法分为在线和离线两个部分,在线部分实时接收数据流,利用局部节点和全局节点实现了网格结构更新的并行化,完成了整体网格结构的增量更新;离线部分基于网格结构的更新结果进行全局聚类,并存储网格帧,供用户查询历史簇.充分利用Storm快速实时处理数据流并显著提高数据流挖掘算法性能的优势,设计并实现了基于Storm的DCD-Stream算法实现方案.该方案通过内存数据库Redis和消息中间件Kafka的应用对DCD-Stream算法的拓扑进行了合理部署与实现.对比验证实验结果表明,相对于其他算法,DCD-Stream算法在数据流对象上有相当高的聚类精度和更好的时效性,基于Storm的DCD-Stream算法实现方案是可行且有效的.%In order to improve the efficiency of data stream clustering algorithm,a data stream clustering algorithm based on centroid distance and density grid (named as CDD-Stream) has been designed and proposed,and a distributed data stream clustering algorithm DCD-Stream (Distributed Centroid Distance D-Stream) has been designed and proposed through adopting the parallelization strategy of updating
grids into CDD-Stream algorithm.The algorithm has been divided into on-line part and off-line part.The online part is responsible for receiving data streams in real time and realizing the parallel updating of the grid structures by using local and global nodes.The off-line part finishes global clustering based on the updated results of grids,and stores grid frames which allows user to query the historical clusters.By making full use of Storm's fast real-time processing of data stream and improving the performance of data stream mining algorithm significantly,a scheme of implementing DCD-Stream algorithm on Storm platform has been designed and implemented.It uses memory database Redis and messaging middleware Kafka to deploy and realize the topology of DCD-Stream algorithm reasonably.The experimental results have shown that compared with other algorithm,DCD-Stream algorithm has considerable clustering quality and better clustering timeliness on data stream objects,and it is practical and effective for implementing DCD-Stream algorithm based on Storm. 【总页数】6页(150-155)
【关键词】数据流聚类;分布式;质心距离;密度网格;Storm 【作者】万新贵;李玲娟;马可
【作者单位】南京邮电大学 计算机学院,江苏 南京 210003;南京邮电大学 计算机学院,江苏 南京 210003;南京邮电大学 计算机学院,江苏 南京 210003 【正文语种】中文
【中图分类】TP311 【文献来源】
https://www.zhangqiaokeyan.com/academic-journal-cn_computer-technology-development_thesis/0201242836212.html 【相关文献】
1.分布式密度和中心点数据流聚类算法的研究 [J], 高宏宾; 侯杰; 刘劲飞 2.分布式实时日志密度数据流聚类算法及其基于Storm的实现 [J], 张辉; 王成龙; 王伟
3.分布式实时流数据聚类算法及其基于Storm的实现 [J], 马可; 李玲娟 4.基于Hadoop MapReduce的分布式数据流聚类算法研究 [J], 蔡斌雷; 任家东; 朱世伟; 郭芹
5.基于Hadoop MapReduce的分布式数据流聚类算法研究 [C], 蔡斌雷; 任家东; 朱世伟; 郭芹
以上内容为文献基本信息,获取文献全文请下载
分布式数据流聚类算法及其基于Storm的实现



