Font Size: a A A

Algorithms For Haplotype Analysis

Posted on:2020-07-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S ZhouFull Text:PDF
GTID:1368330572474382Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
A haplotype is a group of tightly linked genetic variants that tend to always occur together along single chromosome,and can be thought of as a "super allele" composed of multiple loci.Haplotype information plays an important role in many applications such as genome-wide association study(GWAS),linkage analysis,epigenetics,evolu-tionary and population studies.Most organisms,including humans,are diploid.Con-ventional next-generation sequencing(NGS)can only obtain the composed information of the two haplotypes,and the sequence information(also known as phase information)on each chromosome cannot be observed directly.In addition,the pooling DNA design method for pooling and sequencing DNA from different individuals has been widely used in the first stage of GWAS due to its low cost and other advantages.Therefore,how to reconstruct the individual's phase type information from incomplete genotype data or mixed genotype data,and infer the true haplotypes in the population and es-timate the corresponding frequencies are the essential of genomics research,and has been widely concerned.This paper investigates the development history of the main algorithm framework of haplotype analysis,and proposes a new efficient haplotype frequency estimation algorithm,namely CSHAP,based on compressed sensing theory,as well as a new generalized EM algorithm(GEM)based on approximate coalescent prior.Extensive simulation studies have shown that the CSHAP algorithm has excellent performance and extremely high computational efficiency estimating the haplotype fre-quencies.Our algorithm is applicable to both individual design and pooling design,and robust estimates can be given regardless of whether Hardy-Weinberg's equilibrium law is true or not.From the performance of the simulation studies,the accuracy of CSHAP is similar to the widely recognized PHASE algorithm,and the accuracy under small sam-ples is even better than PHASE.In the case of large samples,CSHAP is 2 to 3 orders of magnitude faster than PHASE and can be efficiently applied to large-scale genotype datasets.The time consumption of CSHAP is only up to a logarithm scale of sample size.For pooling DNA designs,CSHAP's computational complexity is independent of the pool's capacity and can support pooling genotype datasets at any large capacity,while allowing the maximum number of sites(loci)far more beyond the best algorithms available in the literature.In the sequencing experiment,due to the laboratory apparatus defects,there are often missing sites,and the missing data will have a great impact on downstream re-search.Therefore,how to impute the incomplete genetic data into complete data is a crucial issue in genome research,which was known as genotype imputation.There are many methods for genotype imputation,including those that are purely statistical,and those based on linkage disequilibrium or based on reference haplotypes.We compared the accuracy of different gene imputation algorithms and extended the EM and CSHAP algorithms to handle missing data imputation.Simulation experiments show that the ac-curacy of the algorithm using haplotype information imputation is higher than that based on linkage disequilibrium.Due to the robustness of compressed sensing for missing,the CSHAP algorithm can provide fairly high imputation accuracy.Meanwhile,CSHAP's frequency estimation accuracy is less affected by the miss rate(relative to other algo-rithms),giving robust estimates even at high missing rates.Inference algorithms based on EM have long been considered to have higher fre-quency estimation accuracy,but the phasing accuracy of which is poor.The frequency estimation and the phasing are different problems,especially the assessment criterion of them arc different.The frequency estimation requires that the estimated haplotype is exactly the same as the real haplotype,but phasing needs to consider the similarity of the resolved solution and true diplotype.We analyzed the reason why EM algorithm does not work well in phasing problem and,by applying principle of parsimony,pro-posed a generalized EM phasing algorithm(GEM).Simulation studies show that GEM's phasing accuracy is much higher than the standard EM algorithm,and is close to that fastPHASE,but GEM's computational efficiency is several orders of magnitude higher than mainstream softwares such as PHASE and Shape-IT.Finally,with further HMM improvements,GEM can support arbitrarily long sequences phasing.
Keywords/Search Tags:Genotype, Single-Nucleotide Polymorphism, Haplotype, DNA Pooling, Hardy-Weinberg Equilibrium, Linkage Disequilibrium, Phasing, Genotype Imputation, Coalescent Theory, Bayesian Inference, Hidden Markov Model, Compressed Sensing
PDF Full Text Request
Related items