Optimization And Application Of K-means Clustering Algorithm Based On Spark Framework

Posted on:2021-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Fu

Full Text:PDF

GTID:2428330647963663

Subject:Computer technology

Abstract/Summary:

In recent years,with the rapid development of the Internet,huge amounts of various data are generated in various sectors of society every day.The explosive growth of data has promoted the arrival of the era of big data.How to dig out useful information from massive data is one of the hot topics of current research.More and more attention has been paid to data mining technology to find valuable contents in data.Cluster analysis as an important part of data mining,in many areas have been fully development and application,but in the context of the current era of big data,the traditional clustering analysis technology in the clustering accuracy and processing efficiency is increasingly difficult to satisfy people's need for massive amounts of data mining,then,through the distributed computing framework such as Hadoop,Spark and so on for implementing distributed parallel clustering algorithm,gives the clustering algorithm is powerful computing ability,the algorithm has better time performance,better improve the efficiency of data mining is the trend of data mining research.In this thesis,the HDFS component in Hadoop will be used for data storage,because it has the advantages of high reliability,high fault tolerance and high scalability.The Spark framework based on memory computing is used for data processing,which has better execution efficiency than the Map Reduce computing framework.Rise to introduce new cuckoo search algorithm of swarm intelligence optimization method,because the algorithm has less parameters,good global search capability,the advantages of fast convergence,therefore,using this algorithm to improve the traditional K-means clustering algorithm,the final will be improved K-means clustering algorithm in the distributed cluster environment applied to Spark the framework of parallel experiments.The specific work is as follows:(1)Because the traditional cuckoo search algorithm has a slower convergence rate and poorer convergence accuracy in the later period,the optimization of the cuckoo search algorithm is improved by introducing an adaptive discovery probability and an adaptive generation step mechanism to speed up the later convergence rate.The accuracy of convergence is improved.(2)Although the original K-means algorithm is simple and the local search ability is strong,if the initial centroid is not selected well,it is easy to fall into the local optimal.The combination of the cuckoo search algorithm and the optimized K-means algorithm makes up for the above-mentioned shortcomings of the K-means algorithm,which enhances the global search ability of the K-means algorithm and provides better clustering effects.(3)In the distributed cluster environment will be improved K-means algorithm proposed in this thesis and the original K-means algorithm and other scholars in recent years,the modified K-means algorithm,experiments with different data sets,and the results show that the proposed improved K-means algorithm compared with other scholars put forward the improvement of K-means algorithm and the original K-means algorithm clustering less execution time,at the same time,this thesis puts forward the improvement of K-means algorithm is compared with the original K-means algorithm parallel speed up faster and faster convergence than extension,In general,the improved K-means algorithm proposed in this thesis has better clustering effect and shorter time cost than the original K-means algorithm and the improved K-means algorithm proposed by other scholars.(4)To improve K-means algorithm proposed in this thesis the application in actual electronic medical records of liver disease in patients with liver function in data processing,by comparing the original K-means algorithm,and other related scholars put forward improved K-means algorithm and improved K-means algorithm proposed in this thesis in the treatment of liver disease in patients with liver function data clustering accuracy and rate of execution time,prove that the proposed improved Kmeans algorithm in practical application,relative to other K-means algorithm clustering performance and better practical value.

Keywords/Search Tags:

K-means clustering, Hadoop, Spark, Group intelligence algorithm, Cuckoo search

Related items

1	Improved Parallel K-means Clustering Algorithm Based On Cuckoo Search
2	Research On K-means Method Based On Cuckoo Algorithm
3	Research On Clustering Recommendation Algorithm Based On Cuckoo Search
4	Research On Machine Learning Clustering Algorithms In The Hadoop Development Environment
5	The Research And Implementation Of Multiobjective K-harmonic Means Clustering Algorithm Using Swarm Intelligence
6	Research And Application Of Cuckoo Search Algorithm
7	Study On Cuckoo Search Algorithm For Optimization Problem
8	Research Of Clustering Analysis Based On Swarm Intelligence Optimization Algorithm
9	Research On Parallel Clustering Algorithm For Large - Scale Data Set
10	Research And Implementation Of Density Peaks Clustering Algorithm