Font Size: a A A

Research On Initial Centers Selection Method For K-modes Clustering

Posted on:2020-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:K L WangFull Text:PDF
GTID:2428330590452976Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,there has been an explosive growth in the volume of data.To transform massive data into usable information,we need to develop some tools which can mine effective information from a large amount of data.Data mining technology can mine novel and regular information or rules from massive data,which are useful for decision.Cluster analysis is one of the most important mining tools in data mining,and has been widely used in many industries.The K-modes clustering algorithm is suitable for dealing with categorical data sets,its idea is easy to understand and is simple to realize.In recent years,it has become a research hot spot in data mining and scientific decision making.However,the quality of K-modes clustering algorithm is particularly sensitive to the choice of initial centers.If the initial centers were selected improperly,various problems are likely to occur,and the clustering effect would not be achieved.Therefore,choosing the appropriate initial clustering centers is a key step of the K-modes algorithm.In this thesis,we study the initialization of K-modes clustering algorithm from the perspective of improving distance measurement and outlier detection.Moreover,we will propose an effective initialization mechanism for K-modes clustering.The main research work of this thesis are as follows:(1)Based on the concepts of knowledge granularity and roughness in rough sets,we propose a new distance metric for categorical data — weighted overlap distance.When calculating the weighted distance metric,we assign different weights to different attributes according to the significance of each attribute,and assign lower weights to irrelevant attributes,so as to solve the problem of different attribute contributions in the actual application process.In addition,we apply the weighted overlap distance to the K-modes algorithm,and propose a K-modes algorithm KMGRE based on the newweighted overlap distance.We conducted related experiments on the UCI dataset,and the experimental results show that the improved K-modes clustering algorithm is superior to the traditional algorithm of K-modes.(2)We propose an outlier detection method based on granular computing and rough set(GR).Since the traditional K-modes clustering algorithm is likely to treat outliers as the initial centers during the process of initialization,which may affect the quality of clustering,in this thesis we take the degree of outlierness of objects as a key factor for selecting the initial center.In view of the problems existing in current outlier detection methods,this thesis proposes an outlier detection method based on granular computing and rough sets.This method adopts a granular computing model based on information table.For any object Ux? and a set of indistinguishable relations on U,according to each indistinguishable relation,we can obtain a particular granule g that contains x(g is a subset of objects).To calculate the degree of outlierness of each object x in U(thus obtain the outliers in U),we first calculate the degree of outlierness of each granule g,then we use the degree of outlierness of granule g to calculate the degree of outlierness of object x.(3)Through combining the weighted overlap distance proposed in(1)with the outlier detection method proposed in(2),a new initial center selection algorithm Ini_WGROD for K-modes clustering is proposed.The initial centers are selected by calculating the degree of outlierness of each object,together with the weighted overlap distance between the current object and the existing initial centers.In Ini_WGROD algorithm,those objects with low degree of outlierness are more likely to be the initial centers,which can avoid the problem of selecting outliers as initial centers.Hence,the clustering quality of K-modes algorithm is improved.In addition,through considering the weighted overlap distance between the current object and the existing initial centers,the phenomenon that multiple initial centers come from the same cluster can be avoided,therefore,the selected initial centers can represent various clusters with high quality.
Keywords/Search Tags:K-modes clustering, Selection of initial centers, Weighted overlap distance, Outlier detection, Granular computing, Rough sets
PDF Full Text Request
Related items