Font Size: a A A

Research And Parallelization Design Of Epistasis Detection Algorithms In Genome-wide Association Studies

Posted on:2016-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z H ZhouFull Text:PDF
GTID:2180330467998858Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the completion of the Human Genome Project, as well as the recentachievements of high-throughput genotyping technologies, genome-wide association studies(GWAS) has been received more and more attention. GWAS maps from genotypes tophenotypes and relationship between them on the scale of genome wide. The data size of thisresearch is the entire genome, rather than data on a single gene, and this research is of greatmeaning for underlying mechanism and etiology of complex human diseases and providingvaluable help for drug researches. In addition, GWAS methodology (such as study design,statistical analysis, interpretation of results) for complex diseases has made great progress, sothis period is called "GWAS first wave".Single nucleotide polymorphisms (SNP) belong to a molecular marker. Researchersfound that often only one or a few nucleotide differences between alleles of the same locus,the different markers between alleles is so called single nucleotide polymorphisms. SNP iswide spread in genome and is also easy to detect, so researchers often tread SNP as the studyin the genome-wide association studies. The term epistasis represents a masking efectwhereby an allele at one locus restricts the allele at another locus from manifesting its efect,which is frst proposed in1909by Bateson. The detection of epistatic interactive effects ofmultiple genetic variants on the susceptibility of human complex diseases is a great challengein genome-wide association studies and may shed light on the identification andcharacterization of genes that influence the risk of common, complex multifactorial disease.Intensive computing problems such as huge biology data handling involving epistasisdetection is a great challenge in genome-wide association studies and apply parallelcomputing technology to handle biology datasets to develop algorithms and computerprograms that can be run under parallel frameworks provides a po tential way to solve thischallenge.Although many methods have been proposed to identify gene-gene interactions, alsocalled epistasis detection, the lack of an explicit definition of epistatic effects, together withcomputational difficulties, makes these methods impossible to be applied into GWAS.Detecting gene-gene interactions in GWAS is a compute-intensive task in GWAS. SNPs are the most abundant source of genetic variation in human genome, the number of which canreach millions in public datasets. Many existing statistical tests such as the chi-square test,likelihood ratio test, test based on entropy cannot well detect epistasis. Other algorithms suchas common logistic regression, Bayesian inference, neural networks, ant colony algorithm,particle swarm optimization, genetic algorithm are also used to detect epistasis, while in fact,the defnitions of epistasis in statistics and inbiology are not exactly consistent and in fact theepistasis model in biology organism could usually be far more complicated than that instatistics, in other words, statistically signifcant associations are necessarily not biologicallysignifcant.According to some classical epistasis detection algorithms proposed in recent years byscientists and the advantages and disadvantages of these various algorithms as well as mainproblems in epistasis detection domain, we developed three different algorithms based onthree different theories, angles and methods, i.e., methods based on Bayesian inference,methods based on ant colony optimization and methods based on the chi-square test. Wecarried out a detailed elaboration of the design ideas, design methods, theory andexperimental results for each algorithm. We compared in aspects of prediction accuracy, runtime, stability, and scalability and so on of these three algorithms in detail. The algorithmbased on ant colony optimization is best in the aspect of prediction accuracy, this may becontributed by the wonderful local and global search ability of ant colony optimization, andthe roulette wheel selection enhances the randomness of this algorithm, the key expertknowledge and heuristic information incorporated into the pheromone updating rule also helpincrease the search ability of this algorithm. As for the run time and speed, the algorithmbased on chi-square test performs best, because we design and implement this algorithmunder Google’s MapReduce platform, which provides an excellent solution to the problem ofdata-intensive analysis existed in GWAS, and we utilize the upper bounds of chi-square test toprune before calculating the statistical value also helps greatly improve the computationalspeed. We applied Bayesian inference and epistasis group into epistasis detection in GWAS,and the performance of this algorithm is between the two algorithms described before, thismeans that there is much room for improvement and application of Bayesian inference theoryfor epistasis detection.Future work: the data-intensive problem is a huge challenge for epistasis detection inGWAS, which results in grand search space and computational burden, we can utilize parallelcomputing technology such CUDA to propose a new algorithm to detect epistasis, andcompare its performance with algorithms using Google’s MapReduce parallel technology, by this way, we provide potential way, theory and technology for epistasis detection in GWAS.We also should try to think of ways to control several issues such as genotyping errors, thefalse discovery rate (FDR) and family-wise error rate (FWER) which may afect the accuracyof our algorithm. The identification and characterization of genes that influence the risk ofcommon, complex multifactorial disease primarily through interactions with other genes andenvironmental factors remains a statistical and computational challenge in geneticepidemiology, which has an important role for treatment of human complex diseases and drugresearch in genetic epidemiology.
Keywords/Search Tags:Epistasis detection, Bayesian inference, Ant colony optimization, Chi-square test, MapReduce
PDF Full Text Request
Related items