Font Size: a A A

Mdl-based Feature Selection For High Dimensional Data

Posted on:2011-10-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1118360308465874Subject:Information security
Abstract/Summary:PDF Full Text Request
Feature selection for high-dimensional data, also called sparse modeling problem, is one of the hot issues in current machine learning research area, aimed at addressing the problem that a majority of the existing methods may fail to produce meaningful results in high dimensional feature spaces. Regularization method is the most popular methodology used in the current study, especially the L1-penalized and the L0-penalized methods. However, prevalent L1-norm regularization approaches share some inherent theoretical drawbacks, such as lack the ability to select out grouped features, and can not select more features than the sample size. Besides, most of the traditional L0-penalized methods are prone to over-fitting in data-sparse environments, mainly due to lack of constraints on the model complexity.Recent research has revealed that stepwise regression with L0 regularization can perform better than L1 algorithms in sparse cases. We therefore proposed three novel L0-penalized high-dimensional feature selection approaches derived from the minimum description length principle (MDL), namely:1. An easily computable feature evaluation criterion, which was derived from the Fisher information approximation to the stochastic complexity criterion, and was simplified by imposing a Gaussian assumption on the parametric model. Then we proposed a novel stepwise regression approach for sparse modeling based on such a simplified stochastic complexity criterion. Numerical results on synthetic and real microarray open datasets show that the proposed approach outperforms prevalent L1-penalized methods and other cutting-edge alternative approaches.2. A biased risk inflation criterion for feature selection, in which we managed to improve the generality of the RIC approach by introducing an L2-penalty to the model parameters in combining with the risk inflation criterion. Based on such a novel evaluation criterion, we proposed another stepwise regression approach for choosing between nested parametric models based on the biased RIC. Empirical studies also demonstrated its superiority to other popular alternatives tested above.3. The third approach was proposed with the intention to borrow strength from the above two approaches and thus to construct a more general solution to the sparse modeling problem. By employing another Tikhonov-type penalty to the parametric model combined with the stochastic complexity criterion, we set up a novel biased minimum description length based feature selection method, which was shown to successfully mitigate the theoretical risks and limitations of the pure MDL approaches. Experimental results on synthetic datasets, real microarray datasets and real image spam datasets show that the proposed method can deal with all kinds of sparse modeling tasks efficiently, it outperforms prevalent L1-penalty methods and other alternatives in all the tests, and demonstrates a superiority over the above two approaches.The theoretical studies and corresponding experimental results provide new and strong evidence for the validity of the claim that stepwise regression with L0 regularization can perform better than L1 algorithms in sparse cases. It is also the author's hope that the proposed findings and conclusions will provide a new perspective and will encourage more investigations along the lines suggested.
Keywords/Search Tags:machine learning, model selection, feature selection, sparse modeling, regularization, L0 norm, high dimensional
PDF Full Text Request
Related items