Font Size: a A A

Classification And Feature Selection On High-dimensional And Small-sampling Data

Posted on:2015-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1268330428974532Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The data sets with high dimensions and small samples are very common in practical applications, such as text data in natural language processing, image data in computer vision, gene expression profiles in bioinformatics, etc. It is hence a challenge for the existing learning algorithms. With the rapid increasing of the data dimension, there are lots of irrelevant and redundant information in these data, which may greatly deteriorate the performance of the machine learning algorithms, increase the computational complexity and meanwhile lead to the problems of "Curse of Dimensionality" and "Over-Fitting". However, feature selection is an efficient way to solve the problem of high dimensions and small samples. This is because feature selection can remove a large number of irrelevant and redundant features, and find a compact feature subset with a high classification accuracy. Thus, it is very significant in both fileds of research and application.In this dissertation, we select gene expression profiles as the experimental data. Feature selection algorithms are applied in the handling of disease classification problem, and the classification accuracy is one of the evaluation indicators of these algorithms. Our work focuses on feature selection for high-dimensional and small-sampling data and our main contributions are as follows:1) Since high-dimensional and small-sampling data could lead to "Curse of Dimensionality", we propose an embedded feature selection algorithm called K-split Lasso. It aims to reduce the data dimensionality for improving the classification accuracy and solve the problem of high computational complexity. First, divide the feature sets into K parts in K-split Lasso, and then select the features from each feature subset using Lasso, finally merge the selected genes into one feature subset. Our experimental results demonstrate that K-split Lasso can improve the prediction accuracy of the classification models, and to some extent, it can solve the problem of "Curse of Dimensionality".2) Since the high-dimensional and small-sampling data could lead to "Over-Fitting", we present a new hybrid feature selection algorithm called GSIL. It aims to select a small set of important features more relevant to the classification task. In our approach, we first apply the feature ranking algorithm Signal Noise Ratio to filter irrelevant features, and then apply Iterative Lasso to eliminate the redundant features. Empirical studies demonstrate that our approach can reduce data redundancy for improving the classification accuracy, and solve the problem of "Over-Fitting". Moreover, the effectiveness of GSIL is verified by comparing with several known feature selection methods.3) Since high-dimensional and small-sampling data could lead to the instability of feature selection algorithm, we use an ensemble learning technique to improve the prediction accuracy of classification models and the stability of feature selection. Currently, most existing feature selection methods only choose an individual small feature subset according to the discriminative power. Although these methods could improve the performance of learning models, the selected subset is prone to result in the instability for its relative limitted amount of information. Thus, we present an ensemble correlation-based feature selection approach ECGS-RG It aims to generate various effective feature subsets for making up the insufficiency of an individual feature subset. ECGS-RG applies information metrics and approximate Markov blanket technique to evaluate the correlation between the candidate feature and the selected subset. Experimental results show that the classification performance and the stability of ECGS-RG algorithm is superior to those only select an individual feature subset in most cases.
Keywords/Search Tags:Feature Selection, High-Dimensionality and Small-Sample, Classification, Lasso, Ensemble Learning
PDF Full Text Request
Related items