Font Size: a A A

Probabilistic techniques for biological data analysis

Posted on:2006-09-24Degree:Ph.DType:Dissertation
University:University of DelawareCandidate:Zeng, YujingFull Text:PDF
GTID:1458390008453649Subject:Engineering
Abstract/Summary:
This dissertation explores the application of probabilistic techniques in several data analysis problems related to biological systems, focusing on the study of how to best incorporate diverse sources of information into the final result. Our efforts have concentrated on two important biological problems: gene expression data analysis and microbial gene identification.; In the work on gene expression data analysis, we developed two novel clustering techniques: the profile-HMM clustering algorithm and the Meta-Clustering algorithm. The first algorithm is designed for a special case, the clustering analysis of gene expression time-course data, and its core technique is a novel hidden Markov model (HMM) specifically designed to explicitly take into account the dynamic nature of temporal gene expression profiles in the clustering process. Then, we extend our study to a more general case, which focuses on integrating various clustering results from a single dataset. In the framework of the Meta-Clustering algorithm, probabilistic techniques are used to implicitly weight each input clustering structure according to how well it reflects the underlying structure of the original data. This extracted information is then incorporated into a single hierarchical clustering result. Simulations with artificial and real data show the promising performance of both algorithms.; The other problem considered in this dissertation is gene identification in microbial genomes. There are several features in the genome sequences that show special patterns for protein coding regions, and it is necessary to incorporate "all" the existing evidence to refine microbial gene identification. The starting point of our work in this area is the study of various important features of the gene structure. Then, a novel framework is proposed to integrate various sources of evidence for automatic gene identification on microbial genomes. The proposed framework, EvidenceN, makes use of a "generalized" probability theory, Dempster-Shafer theory (DST), to integrate multiple evidence sources, and incorporates the information existing through the whole genome sequence in the gene finding process by utilizing a novel evidence network structure. The proposed methods for integration have been tested on real microbial genomes, and the improvement is shown in the results.
Keywords/Search Tags:Data analysis, Probabilistic techniques, Biological, Microbial genomes, Gene
Related items