Font Size: a A A

Enrichment constrained time dependent clustering analysis of time series microarray data

Posted on:2009-11-03Degree:M.SType:Thesis
University:The University of Texas at San AntonioCandidate:Meng, JiaFull Text:PDF
GTID:2448390002995105Subject:Engineering
Abstract/Summary:
DNA microarray experiments simultaneously monitor the expression profiles of thousands of genes. By using this technology, a large amount of genome-wide expression data has been accumulated and made available, providing opportunities to gain system level understanding of gene functions and biological processes. The problem therein concerns how to apply computational methods including clustering to extract desired, useful information from this data.;This thesis investigates a new clustering algorithm for time series microarray data analysis. Clusters of gene expression are considered to be manifestation of transcriptional modules (TMs). Several clustering algorithms [1-3] have been applied to uncover TMs; however, they suffer from the following limitations. First, many algorithms including K-means and signature algorithm depend on ad-hoe parameters that produce clusters that are not optimal. Even when an algorithm is designed to be optimal, the result can only be sail optimal in the mathematical sense but are not necessary to be biologically meaningful, which is the goal of clustering analysis. Second, existing algorithms are all designed to uncover time static transcription modules under a specific experimental condition, thus failing to capture changes of cell state or work on single time series data. We seek to in this paper to overcome these 2 limitations.;First, rather than assuming time static TM, a more realistic scenario is considered where a module is defined on a specific period of time, i.e., a time-varying transcription modules (TVTM). To develop an algorithm for TVTM discovery, a rigorous mathematical definition of TVTM is provided, which defines the information to be extracted from time series expression data. This definition also serves as an objective function, on which an effective time dependent iterative signature algorithm (TDISA) is developed that iteratively refines the modules contents and time periods within a time window, by which time dependency between time adjacent samples is incorporated to stabilize result and guarantee the continuity of modules indentified.;Second, in order to identify time varying modules that are biologically meaningful rather than mathematically optimal, we developed an enrichment-constrained time dependent clustering algorithm (ETA), through which the biological significance of clustering results can be tested. Once the biological significance of a module can be tested according to existing knowledge, it is possible to go further to optimize all the parameters in term of biological significance, so as to identify modules that are most biologically meaningful; meanwhile, since false modules tend to be eliminated due to inconsistency with known biological fact, the reliability and accuracy of the algorithm should also be improved.;Simulation result shows that, when compared with K-means clustering, TDISA can identify more time varying transcription modules with better accuracy even when the gene annotation is incomplete and/or contains error.;ETA was applied to a KSHV human infection dataset. It identified 48 modules that have different biological meanings (different gene categories enriched) and/or show different trends over time, many of which have good match with known biological fact.;The contributions of this work are in two fields. First, a time dependent iterative signature algorithm (TDISA) is developed to retrieve time varying transcription modules (TVTMs) that are rigorously defined. Second, an enrichment-constrained framework based on existing knowledge is proposed to optimize the clustering result in terms of biological significance (compared with most existing methods that search for time static transcription modules optimized in terms of pure mathematics).
Keywords/Search Tags:Time, Clustering, Modules, Microarray, Biological, Data, Existing, Gene
Related items