Font Size: a A A

The Limiting Behavior Of Statistical Inference Problems In Molecular Phylogenetics

Posted on:2022-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:J HuangFull Text:PDF
GTID:1480306560990139Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the development of high-throughput sequencing technologies,it is possible to collect large amounts of genomic data and then reconstruct phylogenetic trees based on the data,which are important to analyze the evolutionary history of populations and inferring evolutionary relationships between different species.In the past few decades,the multispecies coalescent model has become an important model for making inferences about multi-species genomic data.Therefore,this thesis focuses on statistical inference problems in molecular phylogenomics and population genomics based on the multispecies coalescent model,including the estimation of evolutionary parameters such as species divergence times,population sizes and introgression intensity,tree selection and species delimitation.We study the limiting behavior of bootstrap support and the limiting behavior of posterior probabilities in phylogenetic trees.In this thesis,phylogenetic tree reconstruction is considered a problem of statistical model selection and its asymptotic behavior is first investigated.When using the maximum likelihood method for model selection under equally right or equally wrong models,we found that the bootstrap model support converges to a non-degenerate distribution,and conducts corresponding statistical experiments and simulations of phylogenetic trees.In phylogenetic tree selection problems,the posterior probability of an evolutionary tree and bootstrap support are the two most commonly used methods to assess the confidence in the estimated phylogeny.However,the star tree paradox was found by biologists when using Bayesian methods to select phylogenetic trees.The star tree paradox refers to the fact that when multiple phylogenetic trees are selected using Bayesian model posterior probabilities when the alternative models are equally right or wrong,the posterior probability of the selected phylogenetic tree is always close to 100% as the amount of data tends to infinity,but two independent and identically distributed data tend to select different phylogenetic trees,leading to implausible results with high posterior probabilities.Therefore,this thesis investigates the limiting behavior of another commonly used method for generating model support — the bootstrap method when analyzing equally right or equally wrong models,and concludes that model support still converges to a non-degenerate distribution.It is explained that in phylogenetic trees,analysis of different genes or datasets from the same species often results in different optimal species trees with very high support.It is found that the bootstrap model support is not as extreme as Bayesian posterior model probabilities,explaining that the bootstrap support of the model is usually milder than the posterior probabilities of the model.The thesis then goes deeper into the problem of equally right or equally wrong model selection using the m out of n bootstrap method to obtain a sufficient condition for making the m out of n bootstrap model support converge to a single point.Under these conditions,the support obtained for different data sets will converge to a single point as the amount of data tends to infinity when analyzing data that are independently and identically distributed.Thus,to some extent,it can provide help in solving paradoxical problems and in gaining insight into the equally wrong or equally right model.The limiting distribution of the support of the m out of n bootstrap model is compared with the limiting distribution of the support of the bootstrap model,and it is found that the expectation of the limiting distribution of the bootstrap support is often equal to the value of the convergence point of the m out of n bootstrap.Corresponding statistical experiments and simulation experiments on phylogenetic trees are also performed to verify the correctness of the conclusions.Finally,by applying Bayesian methods to the analysis of multispecies coalescent models,this thesis uses computer simulation to investigate the influence of factors in genomic data on the limiting behavior of inferred results.The use of Bayesian model selection methods enables the selection of a relatively correct phylogenetic tree when the alternative models are not equally right or equally wrong models.Therefore,the question of interest is how the various factors in the data affect parameter estimation and model selection,including species divergence times,population sizes,gene introgression intensities and species tree selection.This thesis examines the impact of gene sequence length,the number of loci,and the number of species in genomic data on the major inference problems in population genomics and phylogenomics under the multispecies coalescent model.Of the vast majority of inference problems mentioned above,the number of loci has the greatest impact.The number of species had the least effect on the selection of species trees and the greatest effect on the species delimitation.Increasing the number of loci and increasing the mutation rate had comparable effects on parameter estimates,but the length of sequences had a greater effect on the selection of species trees problem.These findings might help evolutionary biologists to design time-and cost-efficient sampling and sequencing methods for data collection,and to design high-precision and high-accuracy experimental analysis protocols for experimental analysis.It also reveals the amount of information contained in genomic data in different statistical inference problems.Bayesian and bootstrap methods have been widely used in statistical inference problems in phylogenomics and population genomics.In this thesis,we investigate the limiting distribution of model support value when the models under comparison are nearly equally right judged by the Kullback-Leibler divergence and use experimental simulations to illustrate the effect of various factors in genomic data on the limiting behavior of model selection and parameter estimation problems in phylogenetic trees.
Keywords/Search Tags:Bootstrap, phylogenetic tree, model selection, multispecies coalescent model, Bayesian inference
PDF Full Text Request
Related items