Feature Selection And Clustering For High-dimensional Data

Posted on:2016-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y He

Full Text:PDF

GTID:2308330461967823

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Nowadays, with the explosion of information technology, the amount of data is growing exponentially and the data dimension is higher and higher, these phenomena make information more sufficient, but also bring some challenges to data mining algorithms, particularly, for classification and clustering. The high dimension of data not only increases the algorithm’s time complexity and space complexity, but also reduces the accuracy of the algorithm for solving, which has a negative impact on analysis and decision. Dimensionality reduction, one of the hottest topics in machine learning and data mining field, can solve the problem efficiently. The main purpose of dimensionality reduction is to make the algorithm only take the attributes associated with the mission objectives into consideration during the model construction, thereby the time complexity and space complexity of algorithms are reduced, and the efficiencies of algorithms are improved. For different data distribution, dimensionality reduction uses different methods. When high-dimensional data are distributed in a low-dimensional space, the feature selection method is common used; when high-dimensional data are distributed in several low-dimensional spaces, the subspace clustering method is common used. The existing feature selection algorithms and the existing subspace clustering algorithms have their own shortages: for feature selection, the classification accuracies on selected feature subset are not high enough; for subspace clustering, the clustering accuracies are not high enough. For feature selection and subspace clustering, we do some research from the following two aspects:1. When high-dimensional data are distributed in a low-dimensional space, many feature selection algorithms based on mutual information have been proposed, such as:MIFS, mRMR, JMI, CMIM. These algorithms evaluate the candidate feature by selected features, rather than by total features, which may cause classification accuracy on selected feature subset to be not high enough. To solve this problem, proposing a feature selection method called a feature selection method based on PageRank and genetic algorithm (FSCN). Regarding each feature as a network node, creating edges between each two nodes by mutual information, evaluating network nodes by improved PageRank, ranking nodes according to this article criterion and selecting optimal feature subset by genetic algorithm. The experimental results on ten benchmark datasets from UCI demonstrate that the method (FSCN) we proposed can select a better feature subset.2. When high-dimensional data are distributed in several low-dimensional spaces, at present the LS3C/NLS3C algorithm is the most outstanding subspace clustering. However, in the coefficient matrix obtained in the stage of sparse coding by LS3C/NLS3C algorithm, there some coefficients of linear representation of data objects from different subspaces are nonzero, these coefficients are called bad coefficients in this manuscript, bad coefficients lead to a part of the similarities between data objects from different subspaces are nonzero in the similar matrix obtained by coefficient matrix, these similarities are called bad similarities in this manuscript, bad similarities reduce the clustering accuracy of LS3C/NLS3C algorithm. To solve this problem, proposing a method called denucleation latent space sparse subspace clustering (DLS3C/DNLS3C). We reduce the bad coefficients in the stage of sparse coding by adding a Frobenius norm constraint to the optimization function. The Frobenius norm constraint is able to keep all coefficients from coming close to zeros simultaneously in a linear representation. If the data from a union of affine subspaces, our methods will force the coefficient of the data point which is more similar to the target point into keeping away from zero, and the coefficient of the data point which is less similar to the target point into coming closing to zero, then removing the data whose absolute value is smaller, finally, obtaining a better coefficient matrix which can improve the clustering accuracy in the stage of spectral clustering. We perform experiments on the benchmark Hopkins155 of subspace clustering, as results show, DLS3C is able to reduce the bad coefficient and improve the clustering accuracy.

Keywords/Search Tags:

Feature selection, Mutual information, PageRank, Node importance estimation, High-dimensional data, Subspace clustering, Sparse representation, Spectral clustering, Motion segmentation

PDF Full Text Request

Related items

1	Research Of Sparse Subspace Clustering And Its Applications
2	Subspace Clustering Algorithm Based On Feature Selection And Sparse Representation
3	Research On Sparse Subspace Clustering Algorithm And Its Application In Motion Segmentation
4	Improvement Research Of Clustering Algorithm Based On High-dimensional Data
5	Information-Enhanced Subspace Clustering
6	Research On Key Technologies Of Clustering High-dimensional Data Based On Sparse Subspace And Their Applications
7	Subspace Clustering And Its Applications
8	Study On Sparse Subspace Clustering With K Nearest Neighbor Constraint
9	Research On Sparse Subspace Estimation And Clustering Based On L_p Norm Regularization
10	Subspace Clustering Based On Sparse Representation