Font Size: a A A

Feature Selection And Clustering For High-dimensional Data

Posted on:2016-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y HeFull Text:PDF
GTID:2308330461967823Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, with the explosion of information technology, the amount of data is growing exponentially and the data dimension is higher and higher, these phenomena make information more sufficient, but also bring some challenges to data mining algorithms, particularly, for classification and clustering. The high dimension of data not only increases the algorithm’s time complexity and space complexity, but also reduces the accuracy of the algorithm for solving, which has a negative impact on analysis and decision. Dimensionality reduction, one of the hottest topics in machine learning and data mining field, can solve the problem efficiently. The main purpose of dimensionality reduction is to make the algorithm only take the attributes associated with the mission objectives into consideration during the model construction, thereby the time complexity and space complexity of algorithms are reduced, and the efficiencies of algorithms are improved. For different data distribution, dimensionality reduction uses different methods. When high-dimensional data are distributed in a low-dimensional space, the feature selection method is common used; when high-dimensional data are distributed in several low-dimensional spaces, the subspace clustering method is common used. The existing feature selection algorithms and the existing subspace clustering algorithms have their own shortages: for feature selection, the classification accuracies on selected feature subset are not high enough; for subspace clustering, the clustering accuracies are not high enough. For feature selection and subspace clustering, we do some research from the following two aspects:1. When high-dimensional data are distributed in a low-dimensional space, many feature selection algorithms based on mutual information have been proposed, such as:MIFS, mRMR, JMI, CMIM. These algorithms evaluate the candidate feature by selected features, rather than by total features, which may cause classification accuracy on selected feature subset to be not high enough. To solve this problem, proposing a feature selection method called a feature selection method based on PageRank and genetic algorithm (FSCN). Regarding each feature as a network node, creating edges between each two nodes by mutual information, evaluating network nodes by improved PageRank, ranking nodes according to this article criterion and selecting optimal feature subset by genetic algorithm. The experimental results on ten benchmark datasets from UCI demonstrate that the method (FSCN) we proposed can select a better feature subset.2. When high-dimensional data are distributed in several low-dimensional spaces, at present the LS3C/NLS3C algorithm is the most outstanding subspace clustering. However, in the coefficient matrix obtained in the stage of sparse coding by LS3C/NLS3C algorithm, there some coefficients of linear representation of data objects from different subspaces are nonzero, these coefficients are called bad coefficients in this manuscript, bad coefficients lead to a part of the similarities between data objects from different subspaces are nonzero in the similar matrix obtained by coefficient matrix, these similarities are called bad similarities in this manuscript, bad similarities reduce the clustering accuracy of LS3C/NLS3C algorithm. To solve this problem, proposing a method called denucleation latent space sparse subspace clustering (DLS3C/DNLS3C). We reduce the bad coefficients in the stage of sparse coding by adding a Frobenius norm constraint to the optimization function. The Frobenius norm constraint is able to keep all coefficients from coming close to zeros simultaneously in a linear representation. If the data from a union of affine subspaces, our methods will force the coefficient of the data point which is more similar to the target point into keeping away from zero, and the coefficient of the data point which is less similar to the target point into coming closing to zero, then removing the data whose absolute value is smaller, finally, obtaining a better coefficient matrix which can improve the clustering accuracy in the stage of spectral clustering. We perform experiments on the benchmark Hopkins155 of subspace clustering, as results show, DLS3C is able to reduce the bad coefficient and improve the clustering accuracy.
Keywords/Search Tags:Feature selection, Mutual information, PageRank, Node importance estimation, High-dimensional data, Subspace clustering, Sparse representation, Spectral clustering, Motion segmentation
PDF Full Text Request
Related items