Font Size: a A A

Research On Subspace Clustering Based On Attribute Reduction

Posted on:2019-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2348330566966107Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,massive data has been produced in various applications.It has become an urgent requirement for people to acquire useful knowledge from a large amount of data.Data mining has thus become an important research field at present.Cluster analysis is one of the important research directions of data mining.Since cluster analysis belongs to the unsupervised learning scheme,its application in real life is very extensive,e.g.,biological analysis,web log analysis,and etc.However,with the ever-increasing dimensionality of data,the "dimensionality curse" problem of high-dimensional data becomes more and more obvious.When dealing with the clustering problem of high-dimensional data,the results of clustering are often unsatisfactory due to the curse of dimensionality.This is caused by the two major characteristics of high-dimensional data,that is,(1)the data in high-dimensional data sets is usually sparse;(2)because the distances of various data in high-dimensional data sets are similar to each other,the traditional distance-based clustering algorithms become meaningless.The above two characteristics increase the difficulty of clustering.Therefore,how to effectively cluster high-dimensional data has become one of the main research topics in recent years.To solve the above problems,the subspace clustering method has been proposed.As an effective method to deal with high-dimensional data clustering,subspace clustering first projects the data from high-dimensional space into low-dimensional subspace by using some feature selection strategy,and then clusters data in low-dimensional subspace.However,there are still many problems for the existing subspace clustering methods.For instance,during the process of dimension reductionof high-dimensional data,the feature selection method adopted can not effectively preserve the classification ability of the original data,which may lead to the bias of clustering results in the subspace.In addition,most of the existing subspace clustering methods can only deal with numerical high-dimensional data.They can not effectively deal with categorical high-dimensional data.In order to effectively solve the problems of existing subspace clustering methods,in this thesis we apply rough set theory to subspace clustering.First,we propose a rough set attribute reduction algorithm based on granularity decision entropy,called ARGDE.We use ARGDE algorithm to reduce the dimensionality of high-dimensional data.Second,we propose a K-modes clustering algorithm(called WODKM)based on the weighted overlapping distance.We use WODKM algorithm to cluster data in the low-dimensional subspace,which can effectively deal with categorical high dimensional data.Third,we combine ARGDE algorithm and WODKM algorithm together,and propose a subspace ensemble clustering algorithm(called SPECCH)for categorical high-dimensional data.Finally,experiments are performed on multiple UCI data sets.The experimental results show that the proposed subspace clustering algorithm can solve the clustering problem of categorical high-dimensional data.The research work of this thesis mainly includes the following three aspects:(1)An attribute reduction algorithm based on granularity decision entropy is proposed.To solve the problems of existing attribute reduction algorithms based on information entropy,we propose a new information entropy model — granular decision entropy,and design a new attribute reduction algorithm ARGDE based on granularity decision entropy.We perform experiments on multiple UCI data sets.Compared with the traditional algorithms,the proposed algorithm can obtain smaller reducts and higher classification accuracy.(2)To deal with categorical data,a new distance metric,called weighted overlap distance,is proposed,and a K-modes clustering algorithm(called WODKM)based on weighted overlapping distance is proposed.In WODKM algorithm,we use the concepts of attribute significance and rough entropy in rough set theory to calculate the significance of each attribute.When calculating the weighted overlap distance between any two objects,different attributes will be given various weights,according to their significances,which can effectively reflect the differences of different attributes.(3)A subspace ensemble clustering algorithm for categorical high-dimensional data,called SPECCH,is proposed.In SPECCH algorithm,we first use ARGDE algorithm to construct multiple feature subspaces.Second,WODKM algorithm is used to cluster data in the constructed feature subspaces,and multiple clustering results are generated.Third,we ensemble the multiple clustering results via weighted voting.We perform experiments on several UCI data sets.Compared with the traditional algorithms,the proposed algorithm can obtain better experimental results.
Keywords/Search Tags:rough set, granularity decision entropy, attribute reduction, subspace clustering, high-dimensional data, category data
PDF Full Text Request
Related items