Research And Application Of Clustering Algorithm Based On Bigdata

Posted on:2018-03-09

Degree:Master

Type:Thesis

Country:China

Candidate:L J Wang

Full Text:PDF

GTID:2348330518997618

Subject:Probability theory and mathematical statistics

Abstract/Summary:

This paper mainly studies k-means clustering algorithm and its application. In the background of big data, the limitation of traditional clustering algorithm has become more and more obvious. The most obvious is that traditional clustering algorithm is efficient for small-scale simple data set and has good clustering results, but in the face of large-scale high-dimensional data the k-means algorithm is susceptible to the influence of the initial center and the anomalous data, and the clustering accuracy is affected by the k-means algorithm, such as low efficiency and low accuracy. In view of the above problems, this paper analyzes and improves the k-means clustering analysis algorithm for large-scale high-dimensional data, and improves its efficiency and accuracy in large-scale high-dimensional datasets.This chapter combines the kernel principal component analysis method and the k-means algorithm based on information entropy, and makes a preliminary screening of the data attributes according to the information entropy of the attribute, removes the small amount of information according to the specified threshold, reduces the redundancy attribute and then carries on the kernel principal component analysis to the extracted attribute, in order to realize the dimensionality reduction to the data, finally implement the k-means algorithm on the dimensioned data, thus reducing the computation amount of the cluster and improving the calculation of the cluster effectiveness. Secondly, this paper randomly chooses the initial clustering center for the k-means algorithm to make the clustering result unstable. Firstly, the data set is simply sampled randomly to obtain a small sample data set which is basically the same as the original data set. The minimum degree of variance is used to realize the initial clustering center of k-means algorithm, and the adverse effects of uncertain factors such as anomaly point on the initial clustering center are reduced. Secondly, in order to overcome the influence degree of the different attributes of the sample data on the clustering results in the clustering calculation process, the entropy method is used to calculate the attribute weight to improve the clustering accuracy, and the weighted k-means algorithm based on the optimization initial clustering center is proposed, and the feasibility and validity of the proposed algorithm are verified by numerical experiments. Thirdly, this paper applies the weighted k-means algorithm based on the optimization initial clustering center to the aeronautical customer segmentation research field, and further validates the feasibility and effectiveness of the algorithm by numerical experiments.Finally, the main work and shortcomings of this paper are summarized, and the future research ideas are put forward.

Keywords/Search Tags:

large-scale data, dimension reduction, information entropy, kernel principal component analysis, weighted k-means algorithm

Related items

1	A Dimension Reduction Method For Large-scale TExt Categorization
2	A Dimension Reduction Method For Large-scale Text Categorization
3	Secure And Efficient Dimension-reducing Ranked Query Method For Encrypted Cloud Data
4	A Weighted Kernel PCA And The Related Parameters Choice
5	Research On Feature Extraction Based On Principal Component Analysis
6	Research On Sparse Principal Component Analysis
7	Research On Spectral Reflectance Reconstruction Algorithm Based On Kernel Entropy Component Analysis
8	The Research And Application Of Face Recognition Based On Local Binary Pattern And Principal Component Analysis
9	Novelty Detection Based On Robust Kernel Principal Component Analysis And Kernel Entropy Component Analysis
10	A Study On Kernel-based Classification And Dimension Reduction And Its Application