Font Size: a A A

Research On Epistasis Detection Algorithm And Its Implementation On Mapreduce Framework

Posted on:2015-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:A SunFull Text:PDF
GTID:2268330428497990Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of gene chip technology, genotyping all SNPs in genomebecomes possible, GWAS (Genome-wide association study) emerges and becomes the hottopic of Molecular Biology. GWAS chooses SNP as genetic markers, then detects notablevariant sequences in lots of samples, and finally select those genetic loci which may causedisease.Through the use of linkage analysis method which is based on family, the currentmonogenic diseases research has made significant achievements, but common complexdiseases are differentiated, such as diabetes, hypertension, coronary artery disease, becausethese complex diseases may be effected by several genes or environmental factors. Thehereditary mode does not accord with the classical Mendel’s laws. The traditional detectionmethods used in monogenic diseases can’t meet the breakthrough research demand forcomplex diseases, so researchers are trying to find some strategic methods that can be used toanalyze these diseases. The massive GWAS researchers have found many SNPs which areassociated with complex diseases. However, those SNPs only explain a small part of geneticvariation, because those researches only focus on the individual impact of SNP loci and omitsthe co-effect between SNPs. A large number of experimentals show that the complex diseasesare effected by many SNPs together.The concept of interaction between locus is first proposed by Bateson. Due to the in-depthstudy the meaning of the concept has been extended to include compositional epistasis andstatistical epistasis. This paper mainly focuses on statistical epistasis, namely, the epistasisdeviates from the summation of the two allelic effects. There are many ways to study epistasis,and one class of them is model-free. It doesn’t need to assume any model between thegenotype and the phenotype and can find overall interactions in GWAS. The MDR(Multifactor-dimensionality reduction) method first proposed by Ritchie belongs tomodel-free methods. It can effectively reduce the dimensions of data to single dimension byclassifying the genotypes into high-risk or low-risk. However, the MDR method also hasmany problems. The MDR binary classification does not provide any quantitative measure ofdisease risk for each combination of genotypes to allow comparison of the disease risksbetween different combinations of genotypes. The MDR method is prone to false positive andnegative errors. To solve those problems this paper introduces the odds ratio OR and95%CIto classify the genotype risks. The improved method OR_MDR still adopts an exhaustive search strategy. To omit unnecessary interactions and to make the search process moreeffective the paper attempt to use ant colony optimization algorithm as heuristic searchstrategy on the basis of the improved algorithm OR_MDR. The new algorithmACO_OR_MDR compares the size of the interactions by chi-square test, then updates thepheromone concentration of SNP in the iterative process, and the whole ant colony iterationsremain higher density on the significant SNP locus, finally the algorithm can screen thesignificant epistasis. Due to the number of SNP in the whole genome is huge and in order toadequately reduce the computation time this paper codes the ACO_OR_MDR on MapReducecloud computing platform to take advantage of parallel computing.The main contribution work of this paper are as follows:①replace the original genotypeclassification criterion used in MDR with odds ratio OR and95%CI. The risk of genotypecan be described quantitatively, and the95%CI can be used to determine whether the resultsare significantly.②Ant colony optimization algorithm ACO are used in the epistatic effectsearch. Chi-square value is used to describe the size of association and classification accuracyin test data set are used to update the concentration of SNP pheromone. Thus the search areaof SNP pairs are effectively narrowed by theACO’s continuous iterations.Future work:①It is not enough to solely rely on the classification accuracy to updatepheromone concentration in the iteration process, and some priori knowledge of the SNPlocus can be used.②In the actual genome there are always many epistasis effects, so theproblem of detecting epistasis can be regard as multi-extremal problems. The multi-objectiveparticle swarm optimization algorithm can be used in epistatic detection.
Keywords/Search Tags:Epistasis effect, odds ratio, Multifactor-dimensionality reduction, ant optimizationalgorithm, MapReduce
PDF Full Text Request
Related items