Font Size: a A A

Research On Cluster Analysis Bbased On Open-source Cloud Computing Platform With Hadoop

Posted on:2016-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:J P RenFull Text:PDF
GTID:2298330452971215Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the wide application of data collection tools and the rapid development of theInternet. when the traditional clustering algorithms deal with big data, it is difficult toachieve the requirements. The cloud computing platform emerges as required, it evolvesfrom parallel computing. The cloud computing applications with distributed, heterogeneousand other features are suitable for large data processing.The reaserch of improved algorithms mainly include concepts of data field, grid,increment, parallel and mapreduce based traditional clustering methods, which is the mostwidely used to improve the efficiency of clustering algorithms based mapreduce model; Asdata volumes increase, processing large data based cloud computing platform has become ahot spot.The data mining algorithms research gradually become a hot topic based on cloudcomputing platform. For now though mainly including: To reaserch on parallel algorithmsof the general rules, to find the relationship among the data size、the algorithm complexityand nodes, and to find speedup and scalability factors, finally to design the efficient parallelclustering algorithms.Three new algorithms are proposed based cloud computing platform in this paper.(1)For processing massive data, a MapReduce based triangle inequality canopy K-meansalgorithm is proposed. The algorithm takes advantageofthetheoryof thetriangleinequality,reduces the computational redundancy and operation time. The experiments demonstratethat the algorithm reduces the I/O and network transmission of consumption, and overcomesthe shortage of local optimum, so it can effectively process big data based on MapReduceframework.(3) For processing irregularly distributed massive data, a MapReduce basedhierarchical clustering algorithm is proposed. The algorithm combines the theory of MeanShift for preprocessing massive data, and takes advantage of the CURE algorithm toimplement a MapReduce based MS-CURE algorithm. The experiments demonstrate that thealgorithm achieved efficiency and timeliness of a trade-off and better clustering results.(3)Aiming at disadvantages of traditional clustering algorithm to parameter’s sensitive, hightime-complexity and static data. A dynamic and incremental clustering algorithm usingreferences and density is proposed. The creativity of DICURD realizes a dynamic and incremental clustering algorithm based cloud computing. The experimental resultsdemonstrate that the algorithm decreases the parameter of sensitive, and improves efficiencyand resource utilization, is suitable to analyze big data.
Keywords/Search Tags:Big data, Triangle inequality, Mean shift, Dynamic clutering, MapReduce
PDF Full Text Request
Related items