| Technological advances are increasing both the volume and kinds of biological data being generated. These data sets hold great promise for exciting advances in biology and medicine. Because of their sizes, though, manual analysis is often not practical, and novel computational approaches are needed. This thesis investigates the use of machine learning methods for discovering an important class of DNA sequences, known as regulatory elements, that are encoded in the genomes of bacteria.; One set of contributions of this thesis are those related to computational biology. We develop probabilistic models of three types of regulatory elements (promoters, terminators and operons). Key properties of our approach are that it combines heterogeneous evidence sources, predicts all three types of regulatory elements in a single model, and predicts regulatory elements in a set of bacterial genomes simultaneously. We present experiments that show our promoter, terminator and operon predictions all exceed the previous state of the art in terms of accuracy.; Another set of contributions are those related to machine learning. Two of these contributions are novel methods for learning the parameters and structure of a probabilistic grammar. Our empirical evaluation shows that both approaches lead to improved accuracy on a terminator prediction task. Another machine learning contribution of this thesis is a semi-supervised approach to learning from "weakly-labeled" training examples. We show how to acquire and use weakly-labeled examples by exploiting relationships among concepts. Our empirical evaluation shows that these examples can increase accuracy for some training set sizes. A final machine learning contribution of this thesis is a probabilistic framework for representing and predicting overlapping elements in sequence data. Unlike hidden Markov models, which assign labels to individual positions of a sequence, our approach assigns labels to whole subsequences. Experiments designed to test the accuracy of our method show that our approach is more accurate than two alternatives. While each of these machine learning contributions are motivated by properties of the regulatory element discovery problem, they are general and apply to other domains as well. |