Font Size: a A A

Feature Selection Algorithms For High-throughput Data

Posted on:2014-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J GengFull Text:PDF
GTID:1228330398998896Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of high-throughput detection technology, variousHigh-throughput Data, such as gene expression data, single nucleotide polymorphismdata, etc., have been obtained in the fields of life science. These data provide detailedinformation for us to understand the mechanism of disease from multi-levels and thediversity among different populations in the same species. However, these data are highdimensional data with small sample size, in which the number of features is far largerthan the number of samples. Using traditional pattern classification methods to directlydeal with them leads to the phenomenon of “curse of dimensionality”. At present, one ofthe effective methods to avoid the “curse of dimensionality” is to employ featureselection methods to remove the irrelevant features from data before patternclassification. In this dissertation, how to select features considering feature interactionand how to define the measure which reveals multifeature interaction are studied ongene expression data and single nucleotide polymorphism data. The main contents ofour works are as follows:1. The gene selection method which combines principal component analysis withshape analysis does not effectively use class information of samples. Aiming at thisshortcoming,a new gene selection method combining margin maximizing discriminantanalysis with shape analysis is presented. In the gene selection process, the new methodconsiders not only the interaction between genes but also the relevance between genesand class label, which improves the classification performance of selected genes.Experimental results on four microarray gene expression data show that theperformance of the presented method is superior to that of the method which combinesprincipal component analysis with shape analysis. Compared with two state-of-the-artmultivariable filter methods, the presented method also has a definite advantage.2. A feature selection method based on maximum conditional relevance minimumredundancy criterion is presented, which is named CMRMR. The method can be seen asan extension of the maximum relevance minimum redundancy criterion, whose maincharacteristic is that not only the relevance between the selected feature and the newfeature being selected is considered, but also the influence of the selected features onthe relevance between the new feature being selected and class label is considered in theprocess of feature selection. An analysis of the difference between CMRMR and theother methods which are based on conditional mutual information is made, whoseresults show that the existing methods based on conditional mutual information allexpect that the new feature being selected can bring as munch new information of class label as possible, while their differences are the strategy for realizing the object. Theexperimental results on simulated data and gene expression data show that theclassification performance of the feature subset selected by CMRMR is higher than thatof MRMR and is higher than or similar to that of the other methods which are based onconditional mutual information.3. The relevance measures employed by current feature selection methods caneffectively evaluate relevance between a feature and class label or between two features,but they are not considering the influence of the other features on them. In this paper,under the premise of considering feature interaction overall, sparse representationtechnique is applied in feature selection problem and sparse representation coefficient isproposed as a relevance measure for feature selection. The difference between it and theexisting relevance measures for feature selection is that it can reveal the relevancebetween feature and target under the influence of all the other features in the data, whichreflects feature interaction.4. In order to verify the effectiveness of sparse representation coefficient as arelevance measure for feature selection,we first evaluate the classification performanceof the first q important genes selected by sparse representation coefficient on geneexpression data, then we use sparse representation coefficient to replace the relevancemeasure employed by maximum relevance minimum redundancy criterion and theclassical feature selection method FCBF, which creates a new feature selection criterionand a new feature method. The performance of new criterion and method are evaluatedon gene expression data. We also compare the new criterion and method based on sparserepresentation coefficient with other existing criterions and methods for featureselection, whose results shows that the new criterion and method base on sparserepresentation coefficient are effective and the performance of them is higher than thatof the present criterions and methods for feature selection.5. Due to sparse representation coefficient considers influence between features inthe process of evaluating the relevance of features, it can reveal feature interaction to acertain extent. For a clear understanding this advantages of sparse representationcoefficient, the performance of distinguishing similar population through singlenucleotide polymorphism subset selected by sparse representation coefficient isinvestigated. We first construct four classification problems based on HapMap Phase Ⅲhaploypes data. They are American classification problem, Asian classification problem,African classification problem and European classification problem, among which thediversity of the population in the American classification problem is larger than that of the population in the other three classification problems. Then we compare differenceand classification performance of the single nucleotide polymorphism subsetsrespectively selected by the feature selection methods based on sparse representationcoefficient, Symmetrical Uncertainty, modified T-test and Fst. Experimental resultsshow that the performance of method based on sparse representation coefficient isobviously higher than that of the other methods, especially well on distinguishingsimilar population; the difference between the single nucleotide polymorphism subsetsselected by sparse representation coefficient and by the other measures are large and thedistribution characteristics of single nucleotide polymorphisms selected by thesemeasures are also different.
Keywords/Search Tags:feature selection, sparse representation, mutual information, gene expression data, single nucleotide polymorphism
PDF Full Text Request
Related items