With the rapid development of the Internet and AI,huge amounts of data have accumulated on the Internet.These data contain a huge commercial value,which is more and more important to enterprises in reducing the cost of enterprises and increasing the income of enterprises.However,these data are of low value and complex structure,how to mine this valuable information quickly and accurately from these huge data is a difficult problem for us.The use of data mining technology can help us to mine useful information from these huge amounts of data.However,the traditional data mining techniques have some problems such as low accuracy and long analysis time when dealing with massive data or high-dimensional data.Moreover,the traditional operating mode of the data mining algorithms is executed on a single computer.When the data volume is larger,the operating time will be longer.At present,improving the traditional data mining algorithms and then combining the distributed computing platform Spark to parallelize and deploy the improved clustering algorithm into the cluster is a feasible solution to the above problems.this article has carried on the following several aspects research:(1)The K-means algorithm is studied.Aiming at the algorithm is sensitive to the initial clustering center,the K value is uncertain,and it is easy to get the local optimal solution too early,the Genetic Algorithm(GA)is proposed in this paper to improve it.Aiming at the shortcomings that the classical GA algorithm can solve the problem of premature convergence of the globally optimal solution,the classical GA algorithm is improved in two aspects in this paper.On the one hand,the fitness function of GA is changed linearly and adjusted in real time with the change of environment,GA-K-means-L algorithm is designed.On the other hand,the GA-K-means-M algorithm is designed by improving the traditional single mutation operator as a parallel selection mutation operator and solving the problem of falling into the local optimal solution.Finally,these two algorithms are integrated and the GA-K-means algorithm designed in this paper is designed.(2)Aiming at the problem of a too low performance of GA-K-means algorithm when processing massive data in the stand-alone environment,the parallel design and implementation of GA-K-means algorithm using cloud computing platform Spark.(3)Build Spark + YARN cluster to verify the above improvement.The K-means algorithm,GA-K-means-L algorithm,GA-K-means-M algorithm and GA-K-means algorithm are deployed in the stand-alone environment.Then,the data sets of different data volumes and different dimensions are clustered and analyzed.Finally,the average accuracy,the average number of iterations and the average time-consuming of each algorithm are statistically analyzed.The GAK-means algorithm is deployed in Spark + YARN cluster environment.Then,the data sets of different data volumes and different dimensions are clustered under different numbers of nodes.Finally,statistical analysis of the Spark cluster speedup and scalability.The results show that the hybrid K-means clustering algorithm GA-K-means algorithm based on improved genetic algorithm can effectively overcome the shortcomings of the traditional Kmeans algorithm.The parallelization of GA-K-means algorithm in Spark + YARN cluster environment can be efficient to process large amounts of data,high-dimensional data,which has important practical value. |