Font Size: a A A

Research And Implementation On Variable Weighting In K-means Type Clustering

Posted on:2007-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:X M LiFull Text:PDF
GTID:2178360212466979Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
So far, many clustering algorithms have been proposed, but the k-means type clustering algorithms are widely used in real world applications such as marketing research and data mining to cluster very large data sets due to their efficiency and ability to handle numeric and categorical variables that are ubiquitous in real databases.However, a major problem of using the k-means type algorithms in data mining is selection of variables. The k-means type algorithms can't select variables automatically because they treat all variables equally in the clusting process. In pratice, an interesting clustering structure usally occurs in a subspace defined by a subset of the initially selected variables in stead of the entire variables set, some noise variables hinder cluster discovery. Data in real databases, such as customer databases, are often described by a large number of attributes (variables). Selection of a proper set of variables for clustering from a real world database is a very difficult and important problem in data mining applications because variables do not contribute equally to discovery of clusters.Fistly, an automated variable weighting in k-means type clusting algorithm is implemented in this paper and an experiment conducted on a synthetic data set is presented. The W-k-means results are compared with the results from the standard k-means algorithm without variable weighting and the k-means algorithm with the fixed variable weighting to verify the good perfomence of W-k-means in identifying noise variables and discovering cluster. Secondly, in order to handle categorical variables, a new algorithm called W-k-mode based on W-k-means and k-mode is proposed and implemented. in order to handle numeric and categorical variables, a new algorithm called W-k-prototypes based on W-k-means and k-prototypes is proposed and implemented .Finally, based on the W-k-prototypes algorithm, a clusting analysis system fiting the CRISP (Cross Industry Standard Process for Data Mining) model is implemented .
Keywords/Search Tags:data mining, clustering analysis, variables weighting
PDF Full Text Request
Related items