Font Size: a A A

Statistical pattern recognition in genomic DNA sequences

Posted on:2003-07-20Degree:Ph.DType:Thesis
University:The University of Manitoba (Canada)Candidate:Cheung, (Leo) Wang-KitFull Text:PDF
GTID:2468390011988019Subject:Statistics
Abstract/Summary:
This thesis is concerned with probabilistic and statistical approaches for pattern recognition in genomic DNA sequences. Building probabilistic models with a hidden Markov model (HMM) structure and investigating runs-related statistics are two distinct mathematical/statistical/computational research topics that have been widely utilized in the area of bioinformatics. This work coalesces both topics and provides ideas on further broadening them.; The use of the finite Markov chain imbedding (FMCI) technique to study DNA patterns under an HMM is introduced. With a vision of studying multiple runs-related statistics simultaneously under an HMM through the FMCI technique, this work establishes an investigation of a double runs statistic under a binary HMM for DNA pattern recognition. An FMCI-based recursive algorithm is derived and implemented for the determination of the exact distribution of this double runs statistic under an independent identically distributed (IID) framework, a Markov chain (MC) framework, and a binary HMM framework. With this algorithm, a conditional runs test is revised and used to test for randomness against clustering of signals in DNA. Having studied the distributions of the double runs statistic under different binary HMM parameter sets, probabilistic profiles of runs are created and shown to be useful for trapping HMM maximum likelihood estimates (MLEs). This MLE-trapping scheme offers good initial estimates which not only jump-start the Expectation-Maximization (EM) algorithm, but also prevent the EM estimates from landing on a local maximum. Based on parametric bootstrapping with the MLE-trapping scheme, simple methods are used and implemented to construct confidence intervals for the HMM parameters. Applications of the conditional runs statistic, the double runs statistic, and the probabilistic profiles in conjunction with binary HMMs for DNA pattern recognition are demonstrated using human DNA data.; A multivariate class of probabilistic models, Hidden Multivariate Markov Models (HM3s), is also introduced for modelling DNA sequences. As biochemical and biophysical evidence indicates that DNA molecules possess many different aspects beyond their compositional content, creating probabilistic models from a multivariate perspective makes natural biological sense for the analysis of DNA. A bivariate version of the HM3 is developed for exploration of the joint behaviour of the C+G richness pattern and the bendability pattern of DNA.
Keywords/Search Tags:DNA, Pattern, Statistic, HMM, Probabilistic models
Related items