Font Size: a A A

Research About GS Method Based On The Weighted MP Mahalanobis Distance

Posted on:2017-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z L WangFull Text:PDF
GTID:2180330488455285Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Cluster analysis, as a branch of multivariate statistical analysis, has been widely used in various fields of social life. In the cluster analysis, clustering validity index is the key to evaluate the efficacy of the clustering, and the number of clusters to determine is an important issue in the effectiveness of the cluster. In 2000, Tibshirani R. etc proposed "gap statistic" (GS) method for estimating the optimal number of clusters. The method introduces a reference distribution, and determines the optimum number of clusters by comparison the degree of deviation within the class in the reference data set and observation data set.GS method is based on k-means clustering algorithm. For the shortcomings of k-means clustering algorithm randomly choosing initial cluster centers cause instability clustering results, this article presents an approach is based on the weight matrix to determine initial cluster centers.Although GS method is better than a lot of the methods for determining the best number of clusters, but it applies only to relatively simple data sets. The case has a great relationship to measure of similarity selection. Euclidean distance is used as the default measurement method by GS method, it is only available in independent property,susceptible dimension, and equal treatment for property, ignoring different clustering indexes contributions. Classic Mahalanobis distance not only takes into account the correlation between variables index, but also standardized the data set so that data is not affected by the dimension. This paper uses the weighted MP Mahalanobis distance as similarity measure, proposes WMPGS model on the basis of the GS model. Through simulation experiments for some data sets from UCI database and found that WMPGS not only owns the same way of feasibility to GS, but also WMPGap curve can be more reasonably reflect the characteristics of the complex data set, and has better clustering effect than GS. Finally, the paper points out the problems of the method and future research directions.
Keywords/Search Tags:Cluster analysis, k-means clustering algorithm, measure, GS method, WMPGS method
PDF Full Text Request
Related items