With the development and popularization of computers,massive amounts of data are generated every day.How to mine valuable information from the complicated data in time and quickly is of great significance,which requires the use of data mining technology.This paper introduces the development background,trend and related processing platform of data mining,and elaborates the core concepts of classification and Spark.It also discusses the research status of k-means algorithm and the latest improvement methods,then implements the distributed streaming k-means algorithm based on Spark Streaming.Aiming at the inadequacies of k-means,an improved k-means algorithm based on Markov chain Monte Carlo method and annular region search is designed:AMC2A k-means(Assumption Free Markov Chain Monte Carlo Annular Search k-means).The work done in this paper mainly consists of the following parts:(1)Aiming at the problem that k-means algorithm is sensitive to the initial cluster centers,Markov chain Monte Carlo is used to sample initial cluster centers with a new proposed stationary distribution to simulate D~2-Sampling and proposal distribution to accelerate the sampling process.The Markov chain reaches a stationary distribution after a certain number of state transitions,the sampled data from the stationary distribution acts as a cluster centers using this convergence characteristic.The specific process is:regarding the data points as different states and selecting two points that are far apart as known centers,calculating the distance between the remaining data and the two centers as the proposed distribution.The nearest distance between the data and the existing centers is regarded as a stationary distribution,and then the Markov chain is constructed by accepting-rejecting sampling to realize sampling from data and the selected centers are distributed more dispersedly between each other.This process will allow the initial centers to well characterize the spatial distribution of the data,avoiding resampling in the same category and falling into local optimal solutions and it can also significantly reduce the number of iterations of the k-means algorithm.(2)For the problem that the k-means algorithm has much calculations when finding the nearest center,the annular region search based on triangular inequality is proposed.The k-means algorithm calculates the distance between the data and all centers when searching for point’s category which consumes plenty of resources.For this reason,through modulus of the data points and the triangle decision theorem,the interval where the module of the nearest central point locates can be obtained.Based on interval,an annular filter belt is built which must cover the nearest center.Therefore,in each iteration,k-means only searches the partial categories that meet certain conditions,achieving the role of search filtering and cutting down many unnecessary calculations.This method has lower time complexity when dealing with big data sets and multi-category clustering.(3)Through the simulation comparison experiment based on the Spark Streaming platform,compared with the k-means algorithm and the k-means++algorithm,the improved AMC2Ak-means has better performance in clustering effect,processing time and other indicators.Especially with the increase of data size,it shows excellent processing ability.the processing speed is faster and the system throughput is higher in massive data and multi-category flow clustering. |