Font Size: a A A

Research On Feature Selection And Stability Analysis For High Dimensionality Small Sample Size Data

Posted on:2015-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y P NingFull Text:PDF
GTID:2268330428961562Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapidly development of bioinformatics, image processing, text mining and other large-scale data mining problems, the study of data mining is more complex. In real life and scientific research, a lot of high dimensionality small sample size data were generated, if we use these high dimensionality small sample size data for data mining directly, it will prone to the curse of dimensionality. Feature selection can reduce the dimensionality of high dimensionality small sample size data by remove redundancy features and noise characteristics, improve the classification accuracy, reduce the algorithm complexity, and avoid the curse of dimensionality.Existing feature selection methods ignore the stability of feature selection, while feature selection primarily focuses on the classification performance and clustering performance. Stability of feature selection is the insensitivity of the result of a feature selection algorithm to variations to the training set. Stability of feature selection is very important for data mining and machine learning process of high dimensionality small sample size data, unstable feature selection results will bring a lot of ambiguity, and it is difficult to get the understandable feature subset. This paper researches on the feature selection and its stability for high dimensionality small sample size data. The main contributions are summarized as follow:1.In this paper, we review feature selection models, and review some proposed methods and approaches that aim to stabilize feature selection results. We also review the approaches to evaluate stability of feature selection method. In addition, the stability measurements are systematically reviewed in this paper.2.This paper proposes a feature selection method, RF-RCE (Random Forests Recursive Cluster Elimination) feature selection, for high dimensionality small sample size data. RF-RCE is proposed based on SVM-RCE and ISVM-RCE, and RF-RCE use the Random Forest variable importance to score feature. Because of the superiority of the random forest to deal with high dimensionality small sample size, RF-RCE greatly improve the computational efficiency of ISVM-RCE, meanwhile it achieves the close classification accuracy and stability. Also RF-RCE can solve the ultrahigh-dimensional data which ISVM-RCE cannot be resolved.3.In order to improve the stability of feature selection, this paper systematically collates and analyzes the causes of instability of feature selection.This paper also introduces a new stability metrics,which taking into account the feature subset and feature ranking. Moreover, this paper proposes a stable feature selection method based on random forests-REFS (Random Ensemble Feature Selection).By conduct experiments on a lot of high dimensionality small sample size data verify the effectiveness of the proposed method.
Keywords/Search Tags:high dimensionality small sample size, Feature selection, Stability, Random Forests
PDF Full Text Request
Related items