Font Size: a A A

Overlapping codon model, phylogenetic clustering, and alternative partial expectation conditional maximization algorithm

Posted on:2012-07-26Degree:Ph.DType:Dissertation
University:Iowa State UniversityCandidate:Chen, Wei-ChenFull Text:PDF
GTID:1458390011954567Subject:Statistics
Abstract/Summary:
This dissertation makes major statistical contributions in three areas: detection of evolutionary selection in overlapping open reading frames, clustering of genetic sequence data using evolutionary models, and a novel method for speeding up EM algorithms. The first two topics are motivated by questions raised in the context of analyzing Equine Infectious Anemia Virus (EIAV) sequence data collected from infected horses. The third topic was initially intended as an improvement for the EM algorithm underlying our proposed clustering algorithm, but is applied to time series data in this dissertation.;The overlapping codon model is an extension of codon-based models for detecting selection occurring during biological sequence evolution. Specifically, it can model selection acting on overlapping reading frames that encode different proteins. The model breaks the alignment into independently evolving blocks of codons and assumes mutations occur parsimoniously to reduce some complexity of computation. The model can analyze multiple sequences and statistically test for evidence of constrained selection.;Phylogenetic clustering (Phyloclustering) is an evolutionary, model-based approach to identify population structure based on molecular sequence data and is especially efficient for large sequence data sets. A Continuous Time Markov Chain (CTMC) model is assumed for the mutation process, and the model assumes sequences evolve from a few unknown ancestral sequences. A finite mixture model for the process is proposed, and an EM algorithm with analytic formulas for both E- and M-steps is established for finding maximum likelihood estimators. Individual sequences are clustered based on their maximum posterior probabilities. In simulation studies, phyloclustering outperforms existing methods such as hierarchical clustering and K-medoid methods. phyclust is an R package implementing phyloclustering and integrates several useful tools for simulations.;The alternative partial expectation conditional maximization algorithm (APECM) is designed to speed up EM convergence for general finite mixture models, including the one underlying phyloclustering. The algorithm breaks the cycle of the EM algorithm into multiple subcycles, by partitioning the parameter space according to the mixture components. It also relies on data augmentation which substantially speeds up computing. In a time series clustering example, the APECM is as efficient as the EM algorithm in fewer iterations, ultimately converging faster in real time.
Keywords/Search Tags:Clustering, Algorithm, Model, Overlapping, Sequence data, Time, Selection
Related items