Font Size: a A A

Research On Feature Selection And Its Stability For High-dimensional Data

Posted on:2013-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:J BaoFull Text:PDF
GTID:2248330395952890Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Feature selection is one of the key problems in the field of machine learning and pattern recognition. With machine learning and pattern recognition becoming increasingly deeper, the object of learning becoming more complex, the feature dimension of the object is becoming higher and higher. High-dimensional data is a data set which contains hundreds or thousands of features, and it contains a lot of irrelevant and redundant information which may greatly reduce the performance of learning algorithms. Therefore, feature selection is particularly important when faced with high-dimensional data. In order to solve the problem of feature selection on high-dimensional data, domestic and foreign researchers have done a lot of work, which has been widely applied to the field of text classification, risk management, Web classification, medical diagnosis, biological data analysis, genetic genome projects and so on.The existing studies are often too concerned about the classification performance of feature selection to neglect of its stability which refers to the insensitivity of the classification to the small changes of training samples. The stability of feature selection is particularly important especially in the process of discovery of the real variable of a natural model, which has been widely applied to biomarkers. So far, researchers have proposed ensemble feature selection method, feature selection with prior feature relevance, group feature selection, feature selection with sample injection and so on in order to improve the stability of feature selection. In this paper, we study the feature selection methods and its stability on high-dimensional data, and focus on’how to get the feature subset’,’how to evaluate a feature subset’ and how to make the results of feature selection stable’. We draw on the existing research results and improved the original feature selection and criteria of feature subset by using1-Norm SVM, ensemble learning. The main work is as follows:1. Propose an algorithm named SFS-tSVM based on the improved SVM-based evaluation criteria. We improve the existing SVM-based evaluation criteria by adding a threshold, and combine it with one feature selection method. Then we integrate the results of feature selection by data perturbation. Moreover, we design a new stability measure for high-dimensional data. The experiments show that SFS-tSVM can effectively improve the stability of the method.2. Propose an ensemble algorithm called MFS-tSVM based on multiple feature selector and the improved SVM-based evaluation criteria. We integrate the results of the multiple feature selectors by function perturbation. The experiments show that this algorithm can effectively improve the stability of feature selection, and has good classification accuracy. 3. Propose an ensemble algorithm named L1SVM-EFS based on L1SVM. The sparse SVM is applied to the feature selection for high-dimensional data. We combine L1SVM and ensemble method based on data perturbation. The experiments show that this algorithm can effectively improve the stability of feature selection results and does not decrease the classification accuracy.
Keywords/Search Tags:high-dimensional data, feature selection, stability, ensemble learning, L1SVM
PDF Full Text Request
Related items