Font Size: a A A

Haplotype Estimation And Association Analysis

Posted on:2012-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:1100330335962529Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Sincehaplotypeisregardedasaninformation-richcarrieroflinkagedisequilibrium(LD)across different single-nucleotide polymorphism (SNP) loci, estimating haplotypefrequencies from unphased genotyping data has been investigated by many researchers.As a cost-effective alternative to individual genotyping method, pooling DNA designbecomes a common strategy for initial screening in genome-wide association analysis.In some studies, large pools with sizes up to several hundreds were applied in orderto significantly reduce genotyping cost. However, method for estimating haplotypefrequencies from large DNA pools has not been available due to computational burden.This thesis concentrates upon developing efficient and effective algorithms to estimatehaplotype frequencies from DNA pools with arbitrary sizes. We also propose a methodfor haplotype-based association analysis in matched case-control studies.Under the assumption of asymptotic normality of the estimated allele frequenciesand Hardy-Weinberg equilibrium (HWE), we introduce a constant quantity called im-portance factor to measure the contribution of a haplotype to the conditional expecta-tion log-likelihood function, the function of conditional expectation in the expectation-maximization (EM) algorithm is then reformulated as a maximum entropy model sub-ject to a system of linear constraints on the first and second order moments of pooledgenotyping data. This algorithm, named PoooL, can be solved efficiently by the im-proved iterative scaling method. This method achieves the globally optimal solution inthe feasible solution space. Simulation studies show that PoooL can efficiently estimatehaplotypefrequenciesfromlargepoolswithsizesuptohundredsorthousandsandfrompools with sizes as small as one or two individuals. The computational complexity ofPoooL is independent of pool sizes, and the computational efficiency for large poolingDNA design is thus substantially improved over all existing estimating methods. Sim-ulation results also show that the proposed algorithm is robust to genotyping errors andpopulation stratification.Although the PoooL algorithm works well for pooling DNA design with very largesize, the resulting estimates are, however, not maximum likelihood estimates (MLE) becauseof the dependence between allele frequencies and LD coefficients and hence notstatistically optimal. We then propose to revert to the usual EM algorithm to obtainMLEs, but we reduce the computing cost of the E-step by using the ratio of normaldensities approximation. The resulting approximate EM algorithm is much easier to implement because the estimates can be updated easily by substituting the expectedhaplotype frequencies into the complete-data MLEs which are simply the sample pro-portions. This algorithm is adjusted to be applicable to the case of Hardy-Weinbergdisequilibrium (HWD) by introducing the inbreeding coefficient, and will produce ap-proximations of MLEs which are known to be asymptotically optimal. Simulation s-tudies assuming HWE show that the approximate EM algorithm leads to estimates withsubstantially smaller biases and SDs than those from PoooL. Further simulations showthat ignoring HWD will induce biases in the estimates. Our extended algorithm withincorporated inbreeding coefficient is able to reduce the bias leading to estimates withsubstantially smaller mean square errors (MSE).Duetothefactthatonlyafewhaplotypescouldbepresentinpopulation, wedevel-op a unified framework, namely CSPOOL, by maximizing the measurement of sparsityof haplotypes subject to the linear constraints between the first two order moments ofpooledgenotypingdataandhaplotypefrequencies. Thismethodhascloseconnectiontothe typical methodology of the compressive sensing theory which focuses on develop-ing sophisticated decoding algorithm from under-determined linear sensing system byassuming the sparsity of the original signals. CSPOOL can be directly applied to bothpooling DNA design and individual design since the latter can be regarded as a specialcase of pooling design with size equals to one and CSPOOL only relies on the accuracyof moment estimates of genotyping data. Furthermore, CSPOOL can also be adjustedto the case of HWD by incorporating the inbreeding coefficient. When the sample sizeof individual design is relatively small, simulation studies show that CSPOOL beats thestate-of-the-art algorithm PHASE in both MSE and effective cumulative frequencies ofhaplotype. When the sample size is large, CSPOOL works as well as PHASE, but thecomputational complexity of CSPOOL is independent of sample size while the compu-tationalburdenofPHASEincreasesrapidly. ForpoolingDNAdesign, whenthesamplesize is large, the performance of CSPOOL is better than PoooL and is comparable withthe approximate EM algorithm. When the sample size is relatively small, PoooL andthe approximate EM algorithm fail because of the singularity of the estimated LD ma-trix, while CSPOOL can still work revealing that pooling DNA design is much moreefficient than individual design in both experimental cost and statistical efficiency.Assuming a logistic regression model for haplotype-disease association, we pro-pose a retrospective likelihood-based method, NHAP-F, for haplotype-disease associa-tionanalysisundermatchedcase-controldesign. NHAP-Fallowsvariousgeneticmech-anisms and can also be applied to test haplotype-environment interaction. For moderate or raredisease, simulation studies show that NHAP-Fis robust to the departure ofHWEbecause of the using of inbreeding coefficient. The proposed method yields approxi-mately unbiased estimators and is uniformly more powerful than the methods availablein the literature.
Keywords/Search Tags:Pooling DNA Design, Haplotype, Normality Approximation, MaximumEntropy Model, PoooL, Hardy-Weinberg Equilibrium, EM Algorithm, AEM, AES, CompressiveSensing, Sparsity, CSPOOL, RetrospectiveLikelihood, AssociationAnal-ysis, NHAP-F
PDF Full Text Request
Related items