Font Size: a A A

Research On Feature Selection And Semi-Supervised Classification

Posted on:2012-06-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:D S HuangFull Text:PDF
GTID:1488303335950809Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Feature selection and semi-supervised classification are effective methods that can alleviate the problem of low sample data size in high-dimensional space. These methods have achieved fruitful research results in statistics, data mining, machine learning, pattern recognition, bioinformatics and other fields. Especially, feature selection and semi-supervised classification have become a current research focus with important theoretic and practical value in the fields of data mining and machine learning.Some feature selection and semi-supervised classification problems are as follows:(1) for most feature selection methods, all features are ranked according to certain evaluation scores, then the final feature subset is constructed by selecting top ranked features subject to a empirical threshold; although each selected feature has a better discrimination capability than any of the unselected features, the feature subset constructed in this way may not possess the best discrimination capability; (2) the standard co-training algorithm requires two sufficiently redundant views, i.e., the featrues can be naturally divided into two independent sets; unfortunately, this assumption would fail in most practical learning scenarios, especially for low sample data size in high-dimensional space; (3) the existing semi-supervised classification methods that assemble multiple base classification algorithms are used to improve the performance of weak base learning algorithm in low sample size settings; however, they require base learning classifiers with high accuracy, which is difficult to satisfy in the low sample size settings; therefore, the performances of there methods are usually poor; (4) currently, many researchers combine semi-supervised learning methods with manifold learning, trying to use a lot of geometrical information structural information of unlabeled data to design high-precision classification algorithm; but generally these types of semi-supervised methods are complex, and their parameter adjustment methods are complicated.In order to solve the existing problems of feature selection and semi-supervised classification for low sample data size in high-dimensional space. The main contributions of this dissertation are summarized as follows.Firstly, a new feature subset evaluation method which estimates the discrimination capability of candidate features in low-dimensional feature subspace is proposed. By combining the new evaluation method with the best-first search strategy, a new filter method for feature subset selection FSCRF is developed. Experimental results demonstrate that not only the method is able to select a feature subset with smaller number of features for most data sets, but also the performance of classification is good in most cases. FSCRF also achieves satisfactory results on Alzheimer's disease truth data.Secondly, the deficiency of standard co-training is discussed, and a new cross-training based learning algorithm NC-T is proposed. This algorithm generates three classifiers based on the three subsets of original labeled and unlabeled training set. The raw data are randomly divided into three subsets with similar size which include both labeled and unlabeled data. NC-T does not need to assume sufficient and redundant views and different supervised learning methods. Compared with the co-training, the base classifier of NC-T is trained on 2/3 labeled samples instead of 1/2. Experimental results show that the NC-T algorithm can improve classification accuracy in most cases.Thirdly, the existing semi-supervised assemble classification methods require the based classifiers of high accuracy. In order to effectively reduce the requirements to the base classifiers, the SSMAB using multi-class Adaboost is proposed, which can achieve high classification accuracy by exploiting the unlabeled data. The required classification accuracy to each base classifier is only 1/K (K is the number of classes).Fourthly, since the non-metric measure is a more reasonable distance measure between samples, a new cost length of path as a non-metric measure is introduced to measure the affinity between two samples, then the non-metric measure based NMSNN method is proposed. It considers both the direct relationship between two samples and the global information from other samples, to make use of the structure information of unlabeled and labeled samples effectively. The proposed method is simple and practical because there is only one parameter needed to be adjusted. The presented experimental results suggest that the NMSNNis able to use unlabeled data effectively in most cases.
Keywords/Search Tags:Feature Selection, Semi-Supervised Classification, Co-training, Ensemble Learning, Non-metric Measure, High-dimensional and Low-sample, Decision Trees, Neural Network
PDF Full Text Request
Related items