Font Size: a A A

Research On Parallel Clustering Algorithm For Large - Scale Data Set

Posted on:2017-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:X P XiaoFull Text:PDF
GTID:2278330485983949Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Currently, with the popularity of the network, various applications are gradually becoming more and more usual and the scale of the data they produced is increasing, so it needs to analyze and process, making the data mining tasks much more complicated. As an unsupervised learning way, the clustering algorithms are in accordance with the principle of similarity measure classify the data, contributing to extract unknown and valuable information from the massive data.Generally, traditional mode is difficult to meet the requirements of massive data clustering and the distributed computing framework is becoming the new direction for cluster analysis.Therefore, how quickly and efficiently to cluster the large-scale data sets and to mine the potentially valuable information is becoming one direction which has great research value.In recent years, the rapid rise and application of the distributed computing, cloud computing and distributed storage technology provided a new research direction. As Hadoop is an open source project of the Apache, which uses the HDFS(Hadoop Distributed File System) to store the large-scale data sets and uses programming model of the MapReduce to parallelize the clustering algorithm. Spark is a kind of memory-based parallel computing framework, which can complete the data calculation efficiently. In the process, the intermediate result of clustering is directly stored into memory, which can improve the efficiency of the iteration.In this thesis, parallelization is introduced into the large data sets cluster analysis, based on the Hadoop framework and Spark framework. Details are as follows:Firstly, according to the characteristics of large-scale data sets, we summarize the related clustering technology of the large-scale data sets and its application field; based on the analysis of the Hadoop framework and the programming model of the MapReduce, we deeply study the computational framework of the Spark, and at the same time to propose the main thought of the K-means, Canopy and Particle Swarm Optimization algorithm.Secondly, based on the study of K-means, Canopy algorithm and the blindness, randomness of the center selection problem, we propose a parallel algorithm for large data sets namely Bisection of Canopy-Kmeans(BCK-means). Combining bisection method and Initial Dynamic Iterative Principle to determine the initial canopy center and T1, using the MapReduce to realize the parallelization algorithm. These can adapt to the distributed storage application environment of the large data sets to some extent. The analysis of experimental results shows that the result can reflect the inner attribute of the large-scale data sets, and as efficiently as possible use of the computing and storage capacity.Thirdly, based on the second question, considering the adaptive feature of the basic Particle Swarm Optimization algorithm, we propose a dynamic parallel self-adaption PSO K-means algorithm(dsPSOK-means) underlying Spark platform. The algorithm adopts the method of dynamic adaptive inertia weight to improve the basic particle swarm optimization algorithm,obtaining the global optimum. The output of the dsPSO algorithm is regard as the input of Kmeans algorithm, thus raising the intelligence and adaptability of K-means algorithm in selecting the initial center. The dsPSOK-means algorithm is implemented on Spark memory computing framework. The analysis of experimental results shows that dsPSOK-means algorithm can effectively reduce the traffic between nodes in the process, and obtain efficient processing capacity.In short, the study of clustering algorithms for large data sets shades light on the bottleneck of dealing huge data with traditional clustering algorithms. The algorithms are promoting both the efficiency of the process and the quality of the clustering.
Keywords/Search Tags:Hadoop, Clustering Analysis, MapReduce, K-means, Spark
PDF Full Text Request
Related items