Font Size: a A A

Research And Application On Single Nucleotide Polymorphism Analysis Algorithms

Posted on:2011-12-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:J WangFull Text:PDF
GTID:1100360332456452Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
The research on single nucleotide polymorphisms (SNPs) is one of the most important topics in bioinformatics. The finish of rough draft for the human genome in 2000 and the accomplishment of whole Human Genome Project in 2003 are greatly enhance the researches on genetic information in individual genome sequences and the researches on identifying the genome sequence fragment related to human phenotypes. More and more bioinformatics researchers devote themselves to mining genetic markers, analyzing the genetic association or variation of these markers and using these results into the disease association researches. As one of the important genetic markers, the research and application on SNPs are received more and more attentions. However, since the number of SNPs is huge, the efficiency of the current computational method is low. Besides, these methods are usually costly and time-consuming. Thus, in this thesis, we analyze the natural characterestics of SNPs; integrate the machine learning method and the graph theory; and do deep researches on genome sequence polymorphism analysis.The contributions of the dissertation are as follows:(1) An approach based on fliters and ensemble classifiers is proposed for mining SNPs in ESTs.It is costly to manually discover and validate SNPs. Current approaches usually face problems such as the higher false positive and the inapplicability on inhomogeneous species. A novel approach is proposed for SNP mining. First, fliters are constructed using the natural characterestics of SNPs, the SNP candidates are selected from the expressed sequence tag (EST) sequences. Then a group of valid features is defined and the training sets are rebulided in the classifying algorithm. To solve the imbalance learning problem in SNP mining, ensemble learning theory and a strategy similar to AdaBoost are used. Multi-classifiers are built and a reasonable voting mechanism is employed to mine the SNPs from the candidate set. Compared with the current methods, the specificity and the sensitivity of our approach all exceed 80%. The obtained SNPs are more accurate and the ratio of pseudo-SNPs in the obtain SNP set is highly reduced. Thus, the false positive in SNP mining can be highly reduced. The experimental results also examined our approach can be applied in mining SNPs on species which have no genome data and will be helpful to save the cost of biological experiments.(2) A method based on graph and clustering algorithm is proposed for tagSNP selection.Using huge SNPs mined from ESTs in related researches is costly. Several computational methods have been proposed to select the informative SNPs which are also called tagSNPs. To solve the problems such as information loss and restriction limitation in current methods, the SNP graph is firstly defined to describe the linkage disequilibrium and the genetic variation between SNPs, and a graph algorithm based on maximum density subgraph and entropy is proprosed to selection tagSNPs in our method. The tagSNP selection approach for haplotypes and genotypes are investigated based on this graph algorithm. A KNN strategy is employed to preprocess the data and reduced the complexity of graph algorithm on large SNP data. The experimental results examined that our method can reduce the information loss and increase the prediction accuracy.(3) Population structure inference algorithm based on information theory and hierarchical clustering is developed using our SNP mining and tagSNP selection results.Population structure inference is an important problem in SNP analysis. In our thesis, tagSNPs are firstly used as principle features in the population structure inference. The graph based feature selection algorithm is used to reduce the dimensions of genotypes and the influences of noises or invalid SNP loci. The transformation function integrates the sequence distance and the information entropy. A novel algorithm based on the hierarchical clustering is developed. The performaces of our method on simulated and real human data are all good. And the tagSNPs obtained by our feature selection algorithm can be also applied in current inference method and performs well in reducing running time and increasing the inference accuracy.(4) Algorithm on disease population discrimination using human mitochondrial single nucleotide polymorphisms is developed based on the previous results of genome sequence polymorphism analysis in our thesis.The final aim of SNP analysis is to help the disease association research. Disease population discrimination, which is one of the most important problems in the disease association research, is gotten more and more attentions. Compared with current methods, the mitochondrial DNA (mtDNA) is used as research data in our method. The mtDNA sequences are aligned by a keyword tree based algorithm. The mtSNPs are mined from the mtDNA alignments using the genetic characteristic of SNPs and mtDNAs. All unassociated mtSNPs are eliminated using our population structure inference algorithm. A statistical significance based locating algorithm is proposed to finding the disease associated mtSNPs. The highest statistical significant mtSNPs are selected as classification features and used in valid classifiers. The efficiency of our method can be proved by the experiments on the real disease data. Moreover, the significant mtSNPs and its selection algorithm can be also used in other disease association studies.
Keywords/Search Tags:single nucleotide polymorphisms, class imbalance, site graph, information theory, population structure inference, disease population discrimination, bioinformatics
PDF Full Text Request
Related items