Font Size: a A A

Research And Application Of Improved K-means Algorithm In Multivariate Analysis System

Posted on:2017-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:C H WuFull Text:PDF
GTID:2428330566953020Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
Cluster analysis is an important step of data mining.Cluster analysis can find the structure and properties characteristics of unknown data,it is an unsupervised data analysis process.With the rapid development of the information society of today,data analysis has been more and more important guiding significance method for the production of life.K-means algorithm is a traditional method of clustering algorithm based on partition,because of its simplicity,efficiency and scalability that has been widely studied and applied.K-means algorithm exist some problems,which need to give the number of clusters.K-means randomly selected initial cluster centers,leading to instability result and decreased efficiency of the algorithm,always getting local optimum result.Furthermore,when dealing with a large number of high-dimensional data,the traditional K-means algorithm is not adaptable.To solve these problems,this paper made some research and improvement.Specific tasks of this paper are as follows:1?Because the traditional K-means algorithm can not determine the number of clusters,this paper studies using cluster validity index to determine the number of clusters,and mainly introduced the DB Index,CH Index and XB Index,with many experiences found the DB Index have a good effect.2?On the initial clustering center selection,This paper fully studies the K-means algorithm clustering process,found that the selection of the initial clustering center should be separated and close to the actual cluster centers as much as possible.This paper uses a radius to divide,and select the initial cluster centers in order.Through multiple experiments,and the experimental results show that the improved algorithm has a good improvement in the clustering effectiveness and the efficiency of algorithm.3?In the multivariate analysis system,K-means algorithm used for the clustering analysis module.This paper presents a method to calculate the distance between mixed properties.In addition,limited by the computing ability of single computer,when dealing with the massive high-dimensional data,single computer can not finish the cluster analysis.K-means clustering algorithm has good parallel computing features.Therefore,this paper implements the K-means algorithm in Hadoop.By multiple comparison experiments,in Hadoop implements improved K-means algorithm has good lift than traditional K-means algorithm efficiency.
Keywords/Search Tags:Cluster analysis, K-means algorithm, The initial cluster centers
PDF Full Text Request
Related items