Font Size: a A A

Research And Application Of Rough Clustering Algorithm For High Dimensional Data Sets

Posted on:2018-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q ShaoFull Text:PDF
GTID:2348330512477256Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Cluster analysis is one of the important techniques of data mining,which can deal with the numerical data,categorical data and hybrid data.For numerical data,clustering algorithm has achieved very remarkable results.As for the category data,there are many problems to be solved because of the inability to calculate the geometric distance in the traditional sense.For example,the design of reasonable difference function,search for effective clustering initialization mechanism.With high-dimensional mass data appearing in the big data age,the number of its properties to dozens,hundreds or even thousands,they are often incomplete,inaccurate,inconsistent and so on,and the traditional clustering algorithm is difficult to meet the clustering needs of these data,however,constantly enrich the data bring more valuable information.How to extract useful information from high-dimensional data has become the foremost research topic in the field of cluster analysis.Besides,it has become a serious task to design "distance" measurement under high-dimensional data.For the high-dimensional clustering,the most common methods are dimensionality reduction and subspace clustering.Dimension reduction is a particularly effective method to solve the cluster analysis of high-dimensional data.The dimensionality reduction method mainly includes feature transformation and feature selection.Feature selection is a common dimensionality reduction technique in data mining.So far,there is little research on the initialization of the categorical data.If the initial clustering centers selection is unreasonable,it not only can't get the best clustering,but also increase complexity of the algorithm.Especially the high-dimensional categorical data,the initial clustering centers selection is particularly important.At present,there is still no universally accepted initial clustering centers selection algorithm for categorical data.Therefore,it is necessary to propose an initial clustering centers selection algorithm for high-dimensional data clustering.The extended model of classical rough set can deal with incomplete,imprecise and noisy data sets well.Some good clustering algorithms have been obtained with the extended rough set method applied to the processing of incomplete data sets in high-dimension.In order to solve the above problems,this paper uses the extended rough set model-limited tolerance relation to feature selection of high-dimensional incomplete categorical data and design clustering algorithm.The main work includes the following two parts:(1)In this paper,for high-dimensional categorical incomplete data,limited tolerance relation is used to extend the rough set model,and it is used to reconstruct the information entropy and the conditional entropy.Finally,this paper structure algorithm CEHDAR for dimension reduction based on conditional entropy.(2)The weighted overlap distance and weighted average density-based algorithm WDADI for the selection of initial clustering centers.In the algorithm,we use the information entropy of the constraint compatibility relation to define the attribute importance,and then define the weight of each attribute.In calculating the distance between objects and the density of objects,different attributes are assigned to the corresponding weight,which reflects the different contribution of different attributes for clustering.Finally,experiments show that WDADI algorithm is superior to existing clustering initialization methods.Then,it is proved that the improved algorithm is effective by running on the dataset of UCI database.
Keywords/Search Tags:Information Entropy, Weighted Overlap Distance, Initial Clustering Centers, High-dimensional Categorical Incomplete Data, Clustering Analysis
PDF Full Text Request
Related items