Font Size: a A A

Rough Matroid And Its Application To Dimensionality Reduction In High-dimensional Data

Posted on:2015-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:S P WangFull Text:PDF
GTID:1108330473956016Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Almost all industries and application areas contain diverse large-scale unstructured or semi-structured data, which requires us to exploit the underlying principles and extract rules from massive amounts of data. Big data have four remarkable features, namely volume, variety, value and velocity, which implies that the existing technical frameworks and approaches cannot deal with these data efficiently. These also pose a new challenge for human beings to make full use of the data. Therefore, dimensionality reduction is of important since it not only results in a solution to ‘curse of dimensionality’, eases the so-called problem ‘data rich and poor knowledge’ and decreases the computational complexity of time and space, but also provides a new view to understand and represent the relationships among features.From the viewpoints of different data types, the feature selection problems are systematically studied and their special models for discrete and continuous data. Specifically,we provide an evaluation measure for the greedy algorithms, which is much important in the development of efficient greedy algorithms. Kernel functions, sparse representation and nonnegative matrix factorization techniques are incorporated into one joint framework for feature selection problems, and then several efficient feature selection algorithms are proposed. The mainly innovative results involve the following four aspects:(1) Introduce submodular functions to evaluate the feature selection algorithms based on greedy strategies. Many efficient feature selection algorithms are based on smoothness of the continuous objective optimization functions, however, it is valid that they are ineffective in the discrete functions, which leads to the fact that almost all feature selection algorithms for discrete data are established on greedy strategies. As we all know,the solutions searched by greedy algorithms are not usually the global optimal solution,just satisfactory solutions. How to measure the difference between these two solutions is an important but hard problem. We construct the matroidal structures of rough sets and evaluate the approximation degree of the satisfactory solution to the optimal solution via the submodular functions.(2) Propose the linear structure preserving feature selection. How to measure the linearity of the given dataset is a hard problem. That problem is specially important when the relationships among features are well described by linearity. We represent the coherent relationships among features by the linear coefficient matrix of the corresponding sparse coding. Due to the fact that the L1-norm is employed to adjust the sparsity of the coefficients in the sparse representation, the objective optimization functions are non-smooth. Through assigning an upper bound of the norm of the coefficient matrix,this type of non-smooth optimization problems are transformed into the smooth ones.Specifically, neighborhoods are used to describe the locally linear structure of the data,sparse coding and feature selection are incorporated into one joint framework, and the neighborhood embedding feature selection algorithm is also proposed.(3) Incorporate the kernel tricks into feature selection. Kernel functions are of importance to nonlinear data, and their basic principle lies in mapping the original feature space into a implicit and high-dimensional even infinite-dimensional vector space using a nonlinear mapping. It is noted that in this process, the coordinates of the original data in the mapped space are unavailable, and the inner-products between features are available.Due to the unknown coordinates of the data in the high-dimensional space, kernel tricks are unavailable for many feature selection techniques. Drawing support from projection matrices, feature selection problems are expressed as the form of matrix factorization and then kernel techniques are incorporated into the feature selection process.(4) Present an approximation algorithm for higher-order matrix factorization. High dimensionality is a significant characteristic of big data. Matrix factorization is a popular technique for dimensionality reduction, data compression and classification. Many realworld applications can be formalized as higher-order matrix factorization problems, such as clustering and orthogonality constraint based optimization problems. However, almost all existing results for matrix factorization are limited to second-order optimization problems. Feature selection problems are represented as a four-order matrix factorization problems. Using penalized matrices, an approximation algorithm for four-order matrix factorization is proposed and its convergence is proved.In conclusion, this paper starts with the feature selection algorithm for discrete data,presents an evaluation metric for greedy algorithms via submodular functions, embeds the kernel tricks into feature selection problems, proposes a measure for the linear structure and provides an algorithm for higher-order matrix factorization. These research results further enrich the theoretic system of feature selection techniques and point the way to our future works.
Keywords/Search Tags:machine learning, feature selection, sparse representation, matroid, rough set
PDF Full Text Request
Related items