Font Size: a A A

Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark

Posted on:2021-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:C Y RenFull Text:PDF
GTID:2428330623967321Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer software and hardware technology,it has entered the big data information stage.in the face of large-scale data processing tasks,the execution speed and efficiency of traditional data processing is very low and even can not complete the processing tasks at all.distributed computing then appeared.At present,the mainstream distributed computing frameworks include Spark,Hadoop;common cluster computing modules such as HDFS(Hadoop Distributed File System)distributed file storage system,the unique RDD(Resilient Distributed Datasets)elastic distributed data set structure in Spark,Yarn resource scheduling engine,MapReduce parallel computing framework and so on.The emergence of all these modules makes the parallel computing tasks in the era of big data more rapid and efficient.The further mining of the hidden information in the data has an extremely important practical guiding significance for our practical application and production.K-Means spatial clustering analysis is not only an important method of spatial data mining technology,but also one of the key research directions in the field of spatial data mining.At the same time,the traditional data analysis methods can not run directly in the integrated environment,which is also one of the hotspots of academic and industry research in the field of big data.In addition,the basic K-Means clustering algorithm uses Random to determine the cluster center,which makes the clustering result of the algorithm not robust and sensitive to sample outliers,serious and even lead to clustering failure.Therefore,this paper first optimizes the process of initializing the clustering center point of the traditional K-Means clustering algorithm,and completes the design and implementation of the parallel execution strategy of the optimization algorithm combined with the characteristics of the Spark platform.The main research contents are as follows:(1)The improvement of K-Means algorithm.The traditional K-Means algorithm is very sensitive to noise,which makes the traditional K-Means clustering algorithm less robust.In this paper,the initialization part of the center cluster of the traditional clustering algorithm is improved in order to improve the efficiency and robustness of the algorithm.(2)Study the two main modules related to Hadoop,MapReduce distributed framework and HDFS distributed storage system,Spark RDD structure,Spark SQL module,etc.,and Yarn system scheduling.On this basis,it provides an idea for the parallel design of serial algorithms.(3)The parallel design of improved K-Means algorithm is realized based on Spark framework,and the resource parameters and IO are further optimized according to the characteristics of Spark platform.(4)A reasonable test experiment is designed to verify the effectiveness and efficiency of the research content.The experiment is mainly run on stand-alone machine,Spark platform and Hadoop platform.(5)The improved K-Means algorithm based on Spark parallelization is applied to the spatial clustering analysis of national air quality,and the final clustering results are displayed visually.Through the division of the air level index,it is proved that the research content is practical and effective.The conclusions are as follows:(1)the contour factor and CH(Calinski_Harabaz)of the improved K-Means serialization algorithm are better than those of the Spark MLilb algorithm and the original K-Meas algorithm.(2)in the cluster mode,the parallelization speed and speedup of the Spark cluster are better than those of the traditional Hadoop cluster,and the conclusions are as follows:(1)the sum of the contour coefficient and CH(Calinski_Harabaz)of the improved Spark serialization algorithm is better than that of the original K-Meas algorithm and the original K-Meas algorithm.(3)through the cluster analysis of the national air quality spatial data,compared with the known air quality grade data,to verify the practicability and effectiveness of the research content of this paper in the field of national air quality spatial data mining.
Keywords/Search Tags:parallelization, Spark, big data, K-Means, spatial clustering, air quality analysis
PDF Full Text Request
Related items