Optimization Of K-means Clustering Algorithm And Its Implementation On Spark Streaming

Posted on:2020-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y W Wang

Full Text:PDF

GTID:2518306512957079

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development and popularization of computers,massive amounts of data are generated every day.How to mine valuable information from the complicated data in time and quickly is of great significance,which requires the use of data mining technology.This paper introduces the development background,trend and related processing platform of data mining,and elaborates the core concepts of classification and Spark.It also discusses the research status of k-means algorithm and the latest improvement methods,then implements the distributed streaming k-means algorithm based on Spark Streaming.Aiming at the inadequacies of k-means,an improved k-means algorithm based on Markov chain Monte Carlo method and annular region search is designed:AMC2A k-means(Assumption Free Markov Chain Monte Carlo Annular Search k-means).The work done in this paper mainly consists of the following parts:(1)Aiming at the problem that k-means algorithm is sensitive to the initial cluster centers,Markov chain Monte Carlo is used to sample initial cluster centers with a new proposed stationary distribution to simulate D~2-Sampling and proposal distribution to accelerate the sampling process.The Markov chain reaches a stationary distribution after a certain number of state transitions,the sampled data from the stationary distribution acts as a cluster centers using this convergence characteristic.The specific process is:regarding the data points as different states and selecting two points that are far apart as known centers,calculating the distance between the remaining data and the two centers as the proposed distribution.The nearest distance between the data and the existing centers is regarded as a stationary distribution,and then the Markov chain is constructed by accepting-rejecting sampling to realize sampling from data and the selected centers are distributed more dispersedly between each other.This process will allow the initial centers to well characterize the spatial distribution of the data,avoiding resampling in the same category and falling into local optimal solutions and it can also significantly reduce the number of iterations of the k-means algorithm.(2)For the problem that the k-means algorithm has much calculations when finding the nearest center,the annular region search based on triangular inequality is proposed.The k-means algorithm calculates the distance between the data and all centers when searching for point’s category which consumes plenty of resources.For this reason,through modulus of the data points and the triangle decision theorem,the interval where the module of the nearest central point locates can be obtained.Based on interval,an annular filter belt is built which must cover the nearest center.Therefore,in each iteration,k-means only searches the partial categories that meet certain conditions,achieving the role of search filtering and cutting down many unnecessary calculations.This method has lower time complexity when dealing with big data sets and multi-category clustering.(3)Through the simulation comparison experiment based on the Spark Streaming platform,compared with the k-means algorithm and the k-means++algorithm,the improved AMC2Ak-means has better performance in clustering effect,processing time and other indicators.Especially with the increase of data size,it shows excellent processing ability.the processing speed is faster and the system throughput is higher in massive data and multi-category flow clustering.

Keywords/Search Tags:

K-means, Markov Chain, Monte Carlo Sampling, Triangle Decision, Spark Streaming

PDF Full Text Request

Related items

1	Research On Markov Chain Monte Carlo Method For MIMO Systems
2	Monte Carlo Sampling Methods Based On Dynamics
3	Dimension jumping and auxiliary variable techniques for Markov chain Monte Carlo algorithms
4	Applications Of The Markov Chain Monte Carlo Method In Cryptography
5	The Acceleration And Application Of Markov Chain Monte Carlo Method
6	Bayesian Inference for Complex and Large-Scale Engineering Systems
7	Enhanced Sampling Theory And Methods In Monte Carlo Simulation
8	Study On Nonlocal Image Denoising Method Based On Markov Chain Monte Carlo Sampling
9	Image Denoising Based On Low Complexity Non-Local Means And Kalman Filtering
10	Road Image Segmentation Markov Chain Monte Carlo Method