Font Size: a A A

Design And Implementation Of Distributed Clustering Framework Based On Model Fusion

Posted on:2013-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:J Q LiFull Text:PDF
GTID:2268330392970761Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and social networks, big data analysis andmining has become a universally recognized problem. Clustering as a classic means ofdata mining, has to be imporved based on the fundation of distributed architecture inorder to solve the problem of large data calculation,so as to adapt to the realities of thecurrent situation which is full of dispersed and mass data. Course Distributedclustering has become a hot issue in the field of academic research, there are stillendless algorithm improvement. Most of the current distributed clustering algorithmneed communication between nodes, flooded with a lot of redundant data network,while the lack of a central node algorithm to co-ordinate the overall situation; Theway of establish a central node, the central node only play the role of transfering dataand can not be fully play its role because of the impacting of the distribution of dataand quality of data. The article combines these two ways to each other, not onlyestablishing the central node, but to describe the data by the way of the data densitydistribution. Thus reducing the network data transmission, but also to avoid theimpact of the data are unevenly distributed clustering algorithm.Existing distributed clustering algorithm still has several major issues as follows: thedistribution of data, data quality and other factors impact on the results ofclustering,serious lack of global data description in the calculation process, calculatedinefficient transmit large amounts of redundant data.We will conduct improvementsagainst these points.Due to the presence of the above problems, first we use the "one to multi" mode, theone central node which in charge of takeing the whole situation into account,acceptingand transferring the data, each sub-node data in charge of calculating and reportingdata. So as to reducing the waste of resources transfering data between one node toeach other. And then to describe the distribution of the data through the data density,thereby reducing the impaction of data distribution and data quality of the clustering.According the purpose of algorithm designe,we use classic k-means algorithm as the framework of the basic algorithm and adopt the map/reduce of hadoop, hdfs toachieve iteration and data storage and data transmission.According to the analysis of experimental results, the framework of the proposedalgorithm to some extent reduce the calculated time and improve the accuracy of theclustering.
Keywords/Search Tags:distributed clustering, data density described, hadoop, cluster analysis, k-means
PDF Full Text Request
Related items