Font Size: a A A

Research On Parallelization Of K-means Algorithm Based On Spark Plat Form

Posted on:2020-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q LiuFull Text:PDF
GTID:2428330599951309Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the massive data environment has become more and more common.It has become a research hotspot that how to extract valuable information from big data quickly and effectively.Facing the analysis and calculation of massive data,the distributed computing framework has gradually become the main method to solve such problems.The distributed computing framework such as Spark can effectively solve the memory overflow problem in the stand-alone environment,and uses cluster resources to improve the scalability and operational efficiency of traditional data mining technology,and transfers to practical applications which is important to make full use of the information contained in big data.This paper focuses on the parallel design and implementation of K-means clustering algorithm and its optimization method on the Spark platform,and improves the performance from the perspective of the operating efficiency of the algorithm.Firstly,the basic principles of traditional K-means algorithm and the characteristics of Spark are briefly introduced.On the basis of fully studying the characteristics of Spark programming model and parallel design,the K-means algorithm is improved from the aspects of reducing redundant computing and improving sample representation.At the same time,the parallel strategy of the improved algorithm is designed and implemented based on the Spark platform.The main contents and innovations of this paper are as follows:1)In view of the large amount of redundant computation of the original K-means algorithm,the principle and limitations of the triangular inequality optimization method adopted by the Spark machine learning library are analyzed in detail,and an improved method based on spatial distribution information is proposed.The method quantitatively describes the relationship between data points and cluster centers by introducing spatial distribution information,so as to realize the filtering of cluster centers in the process of data point allocation,thereby accelerating the process of data distribution and avoiding most redundant distance calculations in the original algorithm.As a result,the proposed method can fundamentally improve the efficiency of the algorithm.2)Aiming at the problem of insufficient representation of samples taken by traditional random sampling strategy,a density-weighted sampling method is proposed.Through the new sampling strategy,the entire data can be reflected to varying degrees in the sample to improve the quality of the sample,and combine the pre-clustering method to improve the efficiency of the algorithm.Based on the above research results,the improved two strategies are designed and implemented in parallel on the Spark platform,and the running efficiency,scalability and clustering quality of the improved algorithm are verified by experiments.The experimental results show that both improved strategies can significantly improve the efficiency of thealgorithm on the Spark platform,and show a good scalability and speedup in the cluster environment.
Keywords/Search Tags:K-means, triangular inequality, spatial distribution, density weight, Spark
PDF Full Text Request
Related items