Font Size: a A A

Research On Data Mining Technology Based On Spark

Posted on:2020-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:P DuFull Text:PDF
GTID:2428330578965221Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing level of social informatization,a large amount of various data is generated every day.How to extract valuable information from massive and heterogeneous big data has become an urgent problem.To overcome this problem,Data Mining technology came into being.As one of the important techniques in data mining,the cluster analysis method divides the complicated data into different clusters according to the similarity of the data,which is used to accomplish the analysis of the data.However,due to the rapid growth of data volume and the natural defects of traditional clustering algorithms,it is difficult for the traditional clustering algorithm to meet the demand in data mining of massive data in terms of clustering accuracy and processing speed.Therefore,the best and most used way to accomplish the data mining of the data with massive amount is to improve the traditional algorithm in the respect of efficiency and accuracy,and combine the improved algorithm with the distributed processing platform Spark,which enables the improved algorithm running in parallel and makes the processing capability of the improved algorithm much better.According to the above directions,this paper has done the following work: Firstly,it expounds the research background and research significance of this topic,and introduces the research status of data mining technology and distributed platform at home and abroad,and focuses on the related theory of cluster analysis and the knowledge of Spark related technologies.The classical algorithm K-means based on partitioning and the density-based clustering algorithm DPC with high popularity in recent years are selected as the research objects.Secondly,according to the K-means algorithm,the number of clusters K needs to be set in advance and the initial cluster center is randomly selected,which leads to the instability of the number of iterations and slow convergence.The Holdout-based validation method and the K-means++ method are used to improve The K-means algorithm,which is modified to adaptively determine the K value and the initial cluster center points.Through the comparison experiments between original K-means and improved K-means on the MovieLens data set,it is verified that the latter can effectively improve the accuracy of the algorithm and reduce the time cost.Thirdly,according to the shortcomings of the DPC algorithm that the clustering effect of theDensity Peak Clustering Algorithm relies heavily on the subjective setting of the truncation distance,K-nearest neighbors is combined with DPC and distance comparison quantity is introduced as well.The new algorithm CDPC-KNN can adaptively generate the truncation distance for arbitrary datasets,and make the calculation results of local density more consistent with the real distribution of data.The comparison test of the artificial and UCI data sets on the improved algorithm and the separation test were carried out to verify the feasibility of the improved method.Fourthly,the Spark cluster environment is built,and the parallel design and implementation of the improved K-means algorithm and CDPC-KNN algorithm are completed.The algorithm parallel string experiment is used to verify that the parallelized algorithm greatly improves the data processing ability and is more adaptable.,which can handle large-scale data better.
Keywords/Search Tags:data mining, K-means, DPC, distributed computing, Spark
PDF Full Text Request
Related items