Font Size: a A A

Stochastic models for sequence pattern discovery

Posted on:2004-11-29Degree:Ph.DType:Dissertation
University:Harvard UniversityCandidate:Gupta, MayetriFull Text:PDF
GTID:1460390011977042Subject:Statistics
Abstract/Summary:
We focus on the detection of short, recurring, stochastically varying sequence patterns (motifs) in biological sequences. DNA sequence motifs represent binding sites for certain proteins in the process of gene regulation. Current statistical approaches to pattern detection are based on either (a) word enumeration techniques, or (b) positional weight matrix (PWM) scoring. The first approach has difficulties in finding fuzzy (weakly conserved) patterns; while the second often results in high false positive rates. Both have limited flexibility under scenarios such as multiple pattern types, unknown widths, insertions and deletions (gaps), and pattern clustering.; Combining ideas from both these approaches, we propose a novel statistical framework for pattern detection, based on the idea of a stochastic "dictionary". Here, conserved patterns and individual nucleotides are treated as stochastic words generated according to PWMs, and observed sequences generated by concatenations of these words. Two models are developed: (1) independently generated patterns and (2) pattern clusters (modules) of varying composition, with an underlying Markov structure. Employing a missing data approach, we propose a novel data augmentation strategy, which allows us to handle unknown pattern widths, multiple pattern types, gaps, and low-complexity regions. The pattern cluster discovery question is formulated as a state-space estimation problem, and approached through evolutionary Monte Carlo-based techniques (Liang and Wong, 2000). The flexibility of these models is accompanied by a high degree of computational complexity---which are reduced by means of dynamic programming-like recursive techniques.; Determining the level of significance of detected motifs can be regarded as a model selection problem. Calculation of the Bayes factor being analytically intractable, and computationally infeasible, we instead propose a Maximal A Posteriori scoring criterion and deduce a set of necessary and sufficient conditions for its consistency. Finally, we assess the effect of prior formulation, demonstrating the relative robustness of a mixture Dirichlet prior compared to the usual conjugate Dirichlet. With little prior information, some guidance is provided towards parameter choice to ensure robustness. The methodology presented here is fairly general and may be modified to apply to pattern-finding scenarios in fields ranging from signal processing to linguistics.
Keywords/Search Tags:Pattern, Stochastic, Sequence, Models
Related items