Classification And Feature Selection On High-dimensional And Small-sampling Data

Posted on:2015-03-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Zhang

Full Text:PDF

GTID:1268330428974532

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The data sets with high dimensions and small samples are very common in practical applications, such as text data in natural language processing, image data in computer vision, gene expression profiles in bioinformatics, etc. It is hence a challenge for the existing learning algorithms. With the rapid increasing of the data dimension, there are lots of irrelevant and redundant information in these data, which may greatly deteriorate the performance of the machine learning algorithms, increase the computational complexity and meanwhile lead to the problems of "Curse of Dimensionality" and "Over-Fitting". However, feature selection is an efficient way to solve the problem of high dimensions and small samples. This is because feature selection can remove a large number of irrelevant and redundant features, and find a compact feature subset with a high classification accuracy. Thus, it is very significant in both fileds of research and application.In this dissertation, we select gene expression profiles as the experimental data. Feature selection algorithms are applied in the handling of disease classification problem, and the classification accuracy is one of the evaluation indicators of these algorithms. Our work focuses on feature selection for high-dimensional and small-sampling data and our main contributions are as follows:1) Since high-dimensional and small-sampling data could lead to "Curse of Dimensionality", we propose an embedded feature selection algorithm called K-split Lasso. It aims to reduce the data dimensionality for improving the classification accuracy and solve the problem of high computational complexity. First, divide the feature sets into K parts in K-split Lasso, and then select the features from each feature subset using Lasso, finally merge the selected genes into one feature subset. Our experimental results demonstrate that K-split Lasso can improve the prediction accuracy of the classification models, and to some extent, it can solve the problem of "Curse of Dimensionality".2) Since the high-dimensional and small-sampling data could lead to "Over-Fitting", we present a new hybrid feature selection algorithm called GSIL. It aims to select a small set of important features more relevant to the classification task. In our approach, we first apply the feature ranking algorithm Signal Noise Ratio to filter irrelevant features, and then apply Iterative Lasso to eliminate the redundant features. Empirical studies demonstrate that our approach can reduce data redundancy for improving the classification accuracy, and solve the problem of "Over-Fitting". Moreover, the effectiveness of GSIL is verified by comparing with several known feature selection methods.3) Since high-dimensional and small-sampling data could lead to the instability of feature selection algorithm, we use an ensemble learning technique to improve the prediction accuracy of classification models and the stability of feature selection. Currently, most existing feature selection methods only choose an individual small feature subset according to the discriminative power. Although these methods could improve the performance of learning models, the selected subset is prone to result in the instability for its relative limitted amount of information. Thus, we present an ensemble correlation-based feature selection approach ECGS-RG It aims to generate various effective feature subsets for making up the insufficiency of an individual feature subset. ECGS-RG applies information metrics and approximate Markov blanket technique to evaluate the correlation between the candidate feature and the selected subset. Experimental results show that the classification performance and the stability of ECGS-RG algorithm is superior to those only select an individual feature subset in most cases.

Keywords/Search Tags:

Feature Selection, High-Dimensionality and Small-Sample, Classification, Lasso, Ensemble Learning

PDF Full Text Request

Related items

1	Research On Feature Selection And Stability Analysis For High Dimensionality Small Sample Size Data
2	The Study Of Complex Data Processing Method Based On Classification
3	Research On Feature Selection And Semi-Supervised Classification
4	Research On Software Defect Prediction Method Based On Fusion Feature Selection And Ensemble Learning
5	Research About Feature Selection And Classification For Interactive Feature Of High-dimensinal Data
6	Comparative Study On Classification Methods Of Two High-Dimensional And Small Sample Data
7	Research On Deep Learning Methods For Small-sample Image Classification
8	Research On Small-sample Image Classification Based On Neural Networks
9	The Research On Small Sample Image Classification Method Based On Deep Learning
10	Research On High-dimensional Unbalanced Data Classification Algorithm Based On Feature Selection And Ensemble Learning