Font Size: a A A

The Application Of Markov Process In Population Genetics

Posted on:2013-01-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:T Q ZhuFull Text:PDF
GTID:1110330362963435Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Population genetics is a branch of genetics which studies the structure of geneticsand the rule of change of population. It uses probability theory and statistic method,though the study of gene frequency and allele frequency, and the selection and mu-tation which affects these frequencies; through the study of the relationship betweenmigration, gene drift and genetic structure to discuss the mechanics of evolution. Theaim of population genetics is to reveal the mechanics of genetic composition and ge-netic change of populations. To express the composition of population genetics andto have an analytical form require quantitative study of allele frequency and genotypefrequency, and therefore to study the way of evolving of genetic structure in differentgenerations and to compare the genetic difference among populations.The development of science leads to the explosive growth of DNA sequence data.Population genetics has been in the genome era, and to develop model and analyze databecomes more and more important. The main tools to study population genetics areprobability theory, statistic method and algorithms. Because of the property of memo-ryless, it is very appropriate to describe some characterization in the evolution process.Given this, as the best-studied stochastic process, Markov process is a powerful tool todevelop model and analyze data in population genetics. In this article, we use Markovprocess as the main tool, combining with maximum likelihood and Bayesian method,developing models, designing algorithms and analyzing dataset to study problems inpopulation genetics.First, we consider the waiting time for cancer of a cell population. Cancer is wellknown to be the end result of somatic mutations that disrupt normal cell division. Thenumber of such mutations that have to be accumulated in a cell before cancer devel-ops depends on the type of cancer. The waiting time Tm until the appearance of mmutations in a cell is thus an important quantity in population genetics models of car-cinogenesis. Such models are often difficult to analyze theoretically because of the complex interactions of mutation, drift and selection. They are also computationallyexpensive to simulate because of the large number of cells and the low mutation rate.We develop an efficient algorithm for simulating the waiting time Tmuntil m mutationsunder a population genetics model of cancer development.We use an exact algorithm tosimulate evolution of small cell populations and coarse-grainedτ-leaping approxima-tion to handle large populations. We compared our hybrid simulation algorithm withthe exact algorithm in small populations and with available asymptotic results for largepopulations. The comparison suggested that our algorithm is accurate and computation-ally efficient. We also develop a model called Moran model with variable populationsizes, which enables the population size to change with time. We used the algorithm tostudy the waiting time for up to 20 mutations under this model. Our new algorithm maybe useful for studying realistic models of carcinogenesis, which incorporates variablemutation rates and fitness effects.Another work is base on some problems existed in tree length estimation ofBayesian analysis. Recent studies have observed that Bayesian analyses of sequencedatasets using the program MrBayes sometimes generate extremely large branchlengths, with posterior credibility intervals for the tree length (sum of branch lengths)excluding the maximum likelihood estimates. Suggested explanations for this phe-nomenon include the existence of multiple local peaks in the posterior, lack of con-vergence of the chain in the tail of the posterior, mixing problems, and mis-specifiedpriors on branch lengths. Here, we analyze the behavior of Bayesian Markov chainMonte Carlo algorithms when the chain is in the tail of the posterior distribution andnote that all these phenomena can occur. In Bayesian phylogenetics, the likelihoodfunction approaches a constant instead of zero when the branch lengths increase to in-finity. The flat tail of the likelihood can cause poor mixing and undue influence of theprior. We suggest that the main cause of the extreme branch length estimates producedin many Bayesian analyses is the poor choice of a default prior on branch lengths in cur-rent Bayesian phylogenetic programs. The default prior in MrBayes assigns indepen-dent and identical distributions to branch lengths, imposing strong (and unreasonable)assumptions about the tree length. The problem is exacerbated by the strong corre-lation between the branch lengths and parameters in models of variable rates amongsites or among site partitions. To resolve the problem, we suggest two multivariatepriors for the branch lengths (called compound Dirichlet priors) that are fairly diffuseand demonstrate their utility in the special case of branch length estimation on a star phylogeny. Our analysis highlights the need for careful thought in the specification ofhigh-dimensional priors in Bayesian analyses.At last, we developed an isolation-with-migration Model with three species fortesting speciation with gene flow. We implement an isolation with migration modelfor three species, with migration occurring between two closely related species whilean outgroup species is used to provide further information concerning gene trees andmodel parameters. The model is implemented in the likelihood framework for analyz-ing multi-locus genomic sequence alignments, with one sequence sampled from eachof the three species. The prior distribution of gene tree topology and branch lengths atevery locus is calculated using a Markov chain characterization of the genealogical pro-cess of coalescent and migration, which integrates over the histories of migration eventsanalytically. The likelihood function is calculated by integrating over branch lengths inthe gene trees (coalescent times) numerically. We analyze the model to study the genetree-species tree mismatch probability and the time to the most recent common ances-tor at a locus. The model is used to construct a likelihood ratio test of speciation withgene flow. We conduct computer simulations to evaluate the likelihood ratio test, andfound that the test is in general conservative, with the false positive rate well below thesignificance level. For the test to have substantial power, hundreds of loci are needed.Application of the test to a human-chimpanzee-gorilla genomic dataset suggests geneflow around the time of speciation of the human and the chimpanzee.
Keywords/Search Tags:population genetics, Markov process, Bayesian phylogenetics, waiting time for cancer, isolation-with-migration
PDF Full Text Request
Related items