Font Size: a A A

The Research On Clustering Algorithm For Categorical Data Based-on Rough Set

Posted on:2015-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:L L ChuFull Text:PDF
GTID:2298330467454734Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology, we have also seen adramatic increase in the amount of data being stored in electronic format. To obtainmore valuable information from these data, the data mining technology has attractedmuch attention recently. Clustering analysis is an important branch of data mining. Asan unsupervised knowledge discovery method, clustering technology has been widelyused in many practical applications.In many practical applications, we usually need to deal with a large quantity ofcategorical data. Since the categorical data does not have the geometric propertiespossessed by the numerical data, the traditional clustering algorithms can not bedirectly used to deal with the categorical data, hence we need to propose thecorresponding clustering algorithms exclusively for the categorical data. In recentyears, the research on the clustering of categorical data has attracted widely attention,and some novel clustering algorithms such as K-modes, Fuzzy K-modes have beenproposed. However, the current clustering algorithms for categorical data still havemany problems. For instance,the distance metric adopted by them is not reasonable,and there does not exist an effective mechanism to choose the initial centers, etc. Tosolve the problems of the existing methods, in this thesis, we shall discuss theclustering of categorical data by virtue of rough set theory. As an effective tool to dealwith uncertain and incomplete data, the rough set theory has played an important rolein many areas of data mining. Based on the traditional overlap distance, we shall define a new distance metric by using the method of rough sets. Moreover, we shallpropose two novel initial centers selection algorithms for the initialization of K-modesclustering.The work of this thesis mainly includes the following three parts:(1) We propose a new distance measure—the weighted overlap distance, and thuspresent a new K-modes clustering algorithm WODKM based on the weighted overlapdistance. In the WODKM algorithm, we use the concepts of attribute significance inrough set theory and information entropy to calculate the significance of each attribute.When calculating the weighted overlap distance between any two objects, differentweights will be assigned to different attributes according to the significance of eachattribute, which can effectively reflect the differences between various attributes.Experimental results can demonstrate the effectiveness of our algorithm.(2) We apply the distance-based outlier detection method to the K-modes clustering,and use that method to select the initial cluster centers. To avoid that an outlier isselected as an initial center, we introduce the traditional distance-based outlierdetection technology to the K-modes clustering, and hence propose a new initialcenters selection algorithm Ini_Distance. In the Ini_Distance algorithm, we select theinitial centers through calculating the degree of outlierness for each object and theweighted distance between objects, by which the object with a low degree ofoutlierness has more possibility to be an initial center. In addition, through consideringthe distance between any two initial centers, we can also avoid the problem that severalinitial centers come from the same cluster.(3) The distance-based outlier detection method still has some problems, forinstance, the computational complexity of the method is too large and the method isextremely dependent on the threshold of distance, etc. To solve these problems, wefurther propose a new outlier detection method based on the information entropy, anduse this method to choose initial centers of clusters, by which we can obtain a newinitial centers selection algorithm Ini_Entropy. Compared with the Ini_Distancealgorithm, the computational cost of Ini_Entropy algorithm is relatively low, and in the Ini_Entropy algorithm we need not to set the threshold of distance in advance.
Keywords/Search Tags:data mining, rough sets, clustering analysis, outlier detection, weightedoverlap distance, information entropy
PDF Full Text Request
Related items