Font Size: a A A

Information fusion of multiple genomic sensors for clustering and cis-regulatory element identification

Posted on:2007-03-03Degree:Ph.DType:Dissertation
University:The Pennsylvania State UniversityCandidate:Kasturi, JyotsnaFull Text:PDF
GTID:1458390005983239Subject:Biology
Abstract/Summary:
The use of computational techniques for analyzing genomic data has seen rapid growth in recent years, especially with the advent of high-throughput technologies and availability of genome-wide DNA sequences. Co-regulated genes are often involved in similar cellular and biological processes, controlled by common regulators. Genes with similar patterns of expression often exhibit similar regulatory behavior. Further, control sequences corresponding to common regulators may be identified within non-coding DNA sequences of co-expressed genes.;Clustering techniques may be used to identify cohorts of genes with similar expression patterns. The results of a clustering algorithm and the quality of the clusters are largely dependent on the choice of distance measure used to calculate similarity. A novel clustering algorithm, which uses Kullback-Leibler (KL) Divergence to estimate gene similarity, is presented. The KL Clustering algorithm has been applied successfully to Heart Rate Variability data. Due to systematic and experimental variations, gene expression measurements are often noisy. Individual expression profiles are modeled as Gaussian Radial Basis Functions (GRBF) to address this problem. A new approximation method to evaluate KL divergence for GRBFs is introduced. Microarray data alone are limited in their power to identify co-regulated genes. A Combined Clustering algorithm that is capable of incorporating diverse sources of information simultaneously is presented.;Transcriptional regulation is mediated by the interaction between transcription factors and their DNA binding sites, represented by short sequences usually present near the promoter regions of genes. Co-regulated genes often have one or more regulators in common. A search for common sequence patterns within DNA sequences of clustered genes can be used to identify transcription factor binding sites (regulatory elements or motifs). A novel method to identify regulatory elements that discriminate between prespecified gene clusters is presented. The algorithm, based on the Naïve Bayes technique, uses a string-based model to represent motifs. Since the motifs are discriminative, the need for background distributions is completely eliminated. The method is capable of integrating diverse data sources such as gene expression data, sequence data and phylogenetic information (e.g. sequence conservation across species). An evaluation of the identified motifs on mouse genes indicates that comparative genomics significantly improves the quality of the predictions. A new interactive motif visualization tool MotijTreeViz is presented.;Preliminary results on several real data sets indicate that this suite of algorithms produce results that are biologically significant. All the algorithms are designed to be scalable. The software is made publicly available via a web user interface at http://biogeowarehouse.cse.psu.edu. Additionally, the individual programs may be downloaded at http://www.cse.psu.edu/~jkasturi/Software.htm.
Keywords/Search Tags:Clustering, Data, DNA sequences, Regulatory, Information, Genes
Related items