Font Size: a A A

Research On Clustering Algorithms For High-Dimensional Data

Posted on:2019-08-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q M LuFull Text:PDF
GTID:1368330596956547Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As an important part of data mining,data clustering has been widely studied and developed.Compared with supervised classification,clustering analysis has no priori information and belongs to unsupervised learning which increases the difficulty of data clustering.Cluster analysis usually calculates the correlation between the samples according to a given similarity measure,and then divides the data into different partitions.Cluster analysis can not only explore the intrinsic relationships between the samples,but also reveal the distribution characteristics of the raw data.Besides,it can be used as a pre-processing method for data analysis.In the past decades,lots of clustering algorithms have been proposed and successfully applied to many fields,such as machine learning,computer vision,pattern recognition,data compression,and image processing.However,with the development of information technology,the dimension of data has increased dramatically.The traditional clustering algorithms have faced the following challenges and problems: 1)High-dimensional data contain a large number of redundant and irrelevant information.The discriminative information may lie in the feature subset of data;2)High-dimensional data are usually drawn from multiple low-dimensional subspaces.The traditional distance measure function is no longer suitable for high-dimensional data;3)The traditional clustering algorithm has poor stability,especially for high-dimensional data.Therefore,research on clustering algorithms for high-dimensional data is a very significant and challenging topic.In recent years,scholars have conducted researches on high-dimensional data clustering algorithms from three aspects: feature selection,subspace clustering,and cluster ensemble.However,there are still some problems: 1)It is insufficient for the research on exploring the correlation between features;2)The existing methods ignore the effects of the noise and structural characteristics of the representation matrix on subspace clustering;3)The existing cluster ensemble methods ignore the structural information and discriminative information between the clusters in the initial clustering results.In order to solve the above problems,this dissertation carries out research on clustering algorithms for high-dimensional data from four aspects.The main research contents and contributions can be summarized as follows:(1)Structure preserving unsupervised feature selection.The self-expression model is utilized to explore the relationships between the features.So our proposed method performs feature selection without learning a pseudo cluster indicator matrix which avoids bringing the noise into feature selection process.In addition,a structure preserved constraint is incorporated into our model for capturing the local manifold structure in the raw data.Then an iterative optimization algorithm can be employed to solve the proposed model with the theoretical analysis on its convergence.(2)Subspace clustering by Cauchy loss function.The real data are always contaminated by noise and the noise usually has a complex statistical distribution.If we cannot adopt a proper model to deal with the noise,the learned representation matrix may fail to capture the similarity between samples which can reduce the performance of subspace clustering.To solve this problem,Cauchy loss function is used to constrain the noise term.Because the influence function of Cauchy loss function has an upper bound,it can alleviate the influence of a single sample,especially the sample with a large noise,on estimating the residuals.(3)Subspace clustering based on block-diagonal structure.Ideally,a representation matrix should be block-diagonal which means that the intra-class points are relevant and the similarity between the inter-class points should be zero.Therefore,a Laplacian rank constraint is utilized for learning a representation matrix with block-diagonal structure.Furthermore,considering that the elements in the representation matrix denote the correlation between the data points,a non-negative constraint is necessarily added into our problem.Then our objective function can be formulated as a special non-negative matrix factorization problem,and multiplicative update method is utilized to solve the problem.(4)Set-covering based structural cluster ensemble.The proposed method firstly formulates cluster ensemble as a set-covering problem.Then a Laplacian regularization term is added into set-covering problem for capturing the structure information between the clusters.Moreover,in order to explore the discriminative information lying in the initial clustering results,a discriminative constraint is incorporated into our model.So our proposed method is capable of obtain the final clusters with high dispersion degree.
Keywords/Search Tags:High-dimensional data, Clustering Analysis, Feature Selection, Subspace Clustering, Cluster Ensemble
PDF Full Text Request
Related items