Font Size: a A A

Research On Feature Selection And Classification Of Epistatic Gene

Posted on:2012-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2210330362960119Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the accomplishment of the Human Genome Project, research focus of life sciences have shifted from DNA sequencing to gene function. Identification of susceptibile genes for complex diseases and gene-disease associations will enhance understanding of the pathogenesis of complex diseases, and thus improve the prevention, diagnosis and treatment of complex diseases. Although novel and emerging technologies (e.g., gene chips and high-throughput sequencing) have generated a myriad of biological data, not much has been achieved in the study of complex disease due to high dimensionalities of data and the existence of epistasis. Dimensionality reduction for detecting epistasis in biological data and modeling gene-disease relationships become a hot spot of genome-wide association study for complex diseases.We propose a method for reducing the dimensionality of epistasis data and a method for modeling gene-disease relationships. We also develop a program that integrates these two methods. The main contributions of this thesis are as follows:1. A feature selection method based on dynamic instance selection is proposed. ReliefF estimates the quality of attributes based on whether the nearest neighbor of a randomly selected instance from the same class and the nearest neighbor from the other class have the same or different values. Being capable of detecting interactions between attributes, ReliefF and its successors are widely used for epistasis analysis. However, ReliefF statically estimates the attributes on the whole sampling space without considering that any candidate attribute is redundant for labeled instances. To cope with this problem, we introduce an improved method based on dynamic instance selection, which dynamically re-estimates candidate attributes using unlabeled instances. Our proposed method extends ReliefF's ability in filtering epistatic gene.2. A classifier for modeling gene-disease relationships is proposed. The multifactor dimensionality reduction (MDR) method classifies multilocus genotype combinations as either high risk or low risk according to the ratio of cases and controls. Because of an exhaustive examination of combinations of single nucleotide polymorphisms (SNP), MDR can only be applied to small datasets. Under this background, there is an urgent need to accelerate MDR. We propose a method named Tabu search based Multifactor Dimensionality Reduction (TabuMDR), which replaces the exhaustive search with tabu search for combinations of SNP. TabuMDR is tailored from tabu search framework by modifying three aspects, which include encoding solution, initial solution, neighborhood solutions and diversification strategy. Our proposed method can scale up to allow large-scale analyses.3. A program integrating the filter and the classifiler mentioned before for detecting epistasis is developed. By transformation of weights for each attribute estimated by feature selection method into probabilities of being selected when classifier generates solutions, we combine the filter and the classifier mentioned before. Our method is more feasible in practice, because the hybrid method can improve accuracy of the classifier and also decrease the running time.
Keywords/Search Tags:Epistasis, Feature Selection, Dynamic Instance Selection, ReliefF, Classification, Multifactor Dimensionality Reduction, Tabu Search
PDF Full Text Request
Related items