Font Size: a A A

Research And Application Of Clustering Algorithm Based On Bigdata

Posted on:2018-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:L J WangFull Text:PDF
GTID:2348330518997618Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This paper mainly studies k-means clustering algorithm and its application. In the background of big data, the limitation of traditional clustering algorithm has become more and more obvious. The most obvious is that traditional clustering algorithm is efficient for small-scale simple data set and has good clustering results, but in the face of large-scale high-dimensional data the k-means algorithm is susceptible to the influence of the initial center and the anomalous data, and the clustering accuracy is affected by the k-means algorithm, such as low efficiency and low accuracy. In view of the above problems, this paper analyzes and improves the k-means clustering analysis algorithm for large-scale high-dimensional data, and improves its efficiency and accuracy in large-scale high-dimensional datasets.This chapter combines the kernel principal component analysis method and the k-means algorithm based on information entropy, and makes a preliminary screening of the data attributes according to the information entropy of the attribute, removes the small amount of information according to the specified threshold, reduces the redundancy attribute and then carries on the kernel principal component analysis to the extracted attribute, in order to realize the dimensionality reduction to the data, finally implement the k-means algorithm on the dimensioned data, thus reducing the computation amount of the cluster and improving the calculation of the cluster effectiveness. Secondly, this paper randomly chooses the initial clustering center for the k-means algorithm to make the clustering result unstable. Firstly, the data set is simply sampled randomly to obtain a small sample data set which is basically the same as the original data set. The minimum degree of variance is used to realize the initial clustering center of k-means algorithm, and the adverse effects of uncertain factors such as anomaly point on the initial clustering center are reduced. Secondly, in order to overcome the influence degree of the different attributes of the sample data on the clustering results in the clustering calculation process, the entropy method is used to calculate the attribute weight to improve the clustering accuracy, and the weighted k-means algorithm based on the optimization initial clustering center is proposed, and the feasibility and validity of the proposed algorithm are verified by numerical experiments. Thirdly, this paper applies the weighted k-means algorithm based on the optimization initial clustering center to the aeronautical customer segmentation research field, and further validates the feasibility and effectiveness of the algorithm by numerical experiments.Finally, the main work and shortcomings of this paper are summarized, and the future research ideas are put forward.
Keywords/Search Tags:large-scale data, dimension reduction, information entropy, kernel principal component analysis, weighted k-means algorithm
PDF Full Text Request
Related items