Font Size: a A A

Unsupervised Clustering Algorithm Based On Dimension Reduction

Posted on:2018-06-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q DuFull Text:PDF
GTID:1318330533957030Subject:physics
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous improvement of data acquisition capacity and the rapid development of computer,data information is getting more and more,and data dimension is getting higher and higher,how to effectively analyze these massive high-dimensional data and extract the potential rule is one of the challenges of machine learning.In the absence of label information,dimensional reduction of high-dimensional data and clustering are carried out simultaneously for mining the internal structure is one of the difficulties and hotspots of current machine learning.This dissertation focus at representing the original high-dimensional data in a low-dimensional space based on matrix factorization in the absence of label information,and simultaneously clustering the low-dimensional representation in order to discover the intrinsic structure of the category.The main contributions and innovative content of this paper are as follows:1.Most current regression-based feature selection methods utilize the pseudo cluster labels as the target matrix for selecting discriminative features.Such target matrix by definition is a 0 or 1 matrix,so the optimization of those models is an NP-hard problem.In order to effectively deal with this situation,a novel robust unsupervised feature selection via matrix factorization(RUFSM)is proposed.The advantages of this work are three-fold.Firstly,the target matrix is decomposed into two matrices orthogonal clustering center matrix and low dimensional sparse representation matrix,it not only makes the model easy to get iterative solution,but also makes the feature selection matrix(projection matrix)can be better select the discriminative features.Secondly,the centers of the orthogonality constraints and low-dimensional representation of the sparse constraints not only ensure that heterogeneous projection samples are far away from each other,while making the samples between same class close each other.Finally,l2,1-norm as the loss function for filtering out the noise samples and outliers,and robust feature selection and robust clustering are simultaneously performed guarantee algorithm to obtain the overall optimal solution.Secondly,both the latent orthogonal cluster centers and the sparse representation of the projected data points based on matrix factorization are predicted for selecting robust discriminative features.Compared with several stateof-the-art unsupervised feature selection methods,the proposed algorithm comes with better clustering performance for almost all datasets we have experimented with here.2.In order to deal with the non-differentiable nuclear norm of objective function in low rank representation,we propose an algorithm called nonnegative graph regularized low rank concept factorization(GLCF).Firstly,by utilizing the matrix theory,we transform the low rank constraint which used to preserve the global structure of the data into the minimization of the sum of the two factors Frobenius norm,and taking into account the semantic information of the nonnegative constraint in the clustering analysis,the factors are constrained to nonnegative,and the manifold regularization term makes the low-dimensional representation obtain the local geometrical structure of the original samples.Secondly,an multi-step iterative updating optimization scheme is proposed and the convergence of the algorithms is proved theoretically.Thirdly,the connection between the iterative updating rules and the gradient descent method is analyzed,and proposed an iterative updating rules for negative data matrices.Compared with other relative matrix factorization algorithms based on nonnegative constraint,experimental results demonstrate the better clustering performance.3.The current LRR-based approaches have the following drawbacks: 1)the original data,which are used as a dictionary in LRR for seeking the lowest rank representations of data points,usually contain noise and they may not be representative;and 2)the affinity matrix and subspace clustering are obtained in two independent steps.Therefore,we propose an improved LRR-based approach,called graph regularized compact LRR(GCLRR).Firstly,in order to eliminate the effect of the noise sample as a dictionary on the low rank representation,the linear combination of the original data is used as the dictionary.When the linear combination are updates,the dictionary will be updated accordingly.Secondly,the orthogonal linear combination coefficient matrix with the low-dimensional low-rank representation matrix can be considered as the decomposition of the low rank representation matrix in the LRR,therefore,the low-rank representation can be used directly for clustering.Thirdly,the low-dimensional representation used for subspace clustering captures both the global subspace structure(by the low-rankness)and the local manifold structure(by manifold regularization)of the original data.The effectiveness and robustness of the proposed method based on extensive experiments are verified using both synthetic and real data sets,which demonstrated its higher clustering capacity compared with state-of-the-art LRR-based clustering algorithms.
Keywords/Search Tags:clustering analysis, dimension reduction, matrix factorization, l2,1-norm, feature selection, feature extraction, subspace clustering, low-rank representation
PDF Full Text Request
Related items