| This thesis analyzes the problems of the available researches on feature selection based on Support Vector Machines(SVM)firstly.Then the feature selection algorithms are presented based on the specific criterion and the SVM where the specific criterion includes four new kinds of criteria to evaluate the discrimination of features between classes.Among them three criteria focus on the discernibility of a feature between classes,and the other one is about that of a feature subset between classes in classification problems.Finally,according to the properties of gene datasets,the special feature selection algorithms for gene selection are introduced.The author’s major contributions are outlined here.1.The G-score and SVM based feature selection algorithms are proposed to overcome the disadvantages of the feature selection algorithms based on F-score and SVM that can only be used to deal with a binary classification problem.Where,the G-score is the generalization of F-score,so that the criterion can be used to measure the discrimination of features between more than two sets of real numbers.At the same time,the sequential forward search strategy,and the sequential forward floating search strategy,and the sequential backward floating search strategy are generalized.These generalized strategies are referred to as GSFS,GSFFS,and GSBFS in short,respectively,in this thesis.These proposed feature selection algorithms are tested on the datasets from UCI machine learning repository.The experimental results prove the validation of the G-score and SVM based feature selection algorithms,and also show that the algorithm using GSFFS search strategy is the best one according to the size of the selected feature subset,while the one using GSFS search strategy is the optimal when the generalization of the classifier is considered.2.The D-score and SVM based hybrid feature selection algorithms are proposed to conquer the deficiency of the feature selection algorithms based on G-score and SVM where the influence of different measurement units on different features when measuring their discriminability is not considered.This D-score criterion not only has the property as the G-score in measuring the discrimination between more than two sets of real numbers,but also is not influenced by different measure for features when calculating their discriminabilities.D-score is used as a criterion to measure the importance of a feature,and GSFS,GSFFS,and GSBFS strategies are,respectively,adopted as search strategies to select features,whilst SVM is used as the classification tool,so that three new hybrid feature selection methods have been got.The three new hybrid feature selection methods combine the advantages of filters and wrappers where SVM plays the role to evaluate the classification capacity of the selected feature subset via the classification accuracy,and leads the feature selection procedure.These new hybrid feature selection algorithms are tested on nine datasets from UCI machine learning repository and compared with the corresponding algorithms that are based on G-score and SVM.Experimental results show that the D-score and SVM based hybrid feature selelction algorithms outperform the ones based on the G-score and SVM,and can implement the dimension reduction without compromising the classification capacity of datasets.Among the three hybrid feature selection algorithms based on D-score and SVM,the one using the GSFFS search strategy is the best one according to the size of the selected feature subset,and the one based on the GSFS search strategy is best when considering the generalization of a classifier.3.The DFS(Discernibility of Feature Subsets)and SVM based hybrid feature selection algorithms are brought forward ot avoid the deficiencies in the G-score and SVM based or the D-score and SVM based hybrid feature selelction algorithms where the influence of the correlation between features is not considered when evaluating the importance of features between classes in classification problems.The DFS criterion considers the contribution of each feature in a feature subset to classification together.It computes the combination G-score of features in a feature subset,and uses the combination G-score as the discernibility of the feature subset.The strategies for searching features are the four popular and classic search strategies including sequential forward search(SFS),sequential backward search(SBS),sequential forward floating search(SFFS),and sequential backward floating search(SBFS).However there are some differences to classic SFFS and SBFS when we use them.We add a feature in SFFS,or delete a feature in SBFS,according to the value of DFS of the feature subset,and make a juge to delete or bring back that feature when floating using the accuracy of the classifier on training subset.In addition,we put forward the improved CFS(Correlation based Feature Selector)criterion,named as CFSPabs(Correlation based Feature Selector based on the absolute of Pearson’s correlation coefficient).The CFSPabs does not consider the positive or negative correlation between features such as the CFS does,it only considers whether the features are correlated or not.The DFS and SVM based hybrid feature selection algorithms are tested on 10 UCI machine learning repository datasets.Experimental results show that DFS and SVM based hybrid feature selection algorithms are better than the ones based on CFS and SVM or on CFSPabs and SVM.Howevere,the CFSPabs and SVM based feature selection algorithms are the best one when considering the size of a feature subset.4.Considering the generalization of SVM on nonlinear classification problems,the new feature selection algorithms are introduced here based on the SVM classifiers to defeat the potential deficiencies of the feature selection algorithms based on G-score and SVM,or on D-score and SVM,or on DFS and SVM where the active features may be deleted when dealing with the nonlinear classification problems.At the same time the disadvantages of the very popular SVM based feature selection algorithm SVM-RFE proposed by Guyon is solved as well.The new algorithms are SVM RFA(SVM Recursive Feature Addition)and SVM RFE(SVM Recursive Feature Elimination).They calculated the importance of a feature according to the weights of it in SVM models.SVM RFE generalized SVM-RFE proposed by Guyon for binary classification problems to dealing with any classification problems.SVM RFA is based on the forward search strategy compared to the SVM RFE that relies on the backward strategy.These two feature selection algorithms are tested on nine classic datasets from UCI machine learning repository.The experimental results demonstrate that these two SVM model based feature selection algorithms can reduce the dimension of datasets without compromising the classification capacity of them,and the SVM RFA outperforms SVM RFE on eight datasets among the nine ones.Furthermore,from the experimental results we can see that SVM RFA is much more efficient than SVM RFE on high dimension datasets.5.According to the characteristic of gene data sets where the samples of a data set is often a few dozen while the dimension or features of a sample is usually several thousands to tens of thousands,and referring to the conclusion of the above study we propose the gene selection algorithm---SVM RRFA(SVM Recursive Random Feature Addition,SVM RRFA).In SVM RRFA,we randomly determine the number of genes that should be added at once iteration in the gene selection procedure according to the dimension of a specific data set.In order to reduce the run time of SVM RRFA,we developed the simplified SVM RRFA for gene selection.We tested our SVM RRFA and our simplified SVM RRFA on three gene data sets from the gene expression project of Princeton University.The experimental results show that our SVM RRFA can find the discernibility genes between cancer patients and normal people,and can classify the cancer people from normal people efficiently,and the simplified SVM RRFA outperformed SVM RRFA in accuracy,specificity and Matthews correlation coefficient,but the sensitivity is not improved. |