Font Size: a A A

Microarray analysis: Choice of metric, new clustering algorithm and identification of transcription factors

Posted on:2006-08-13Degree:Ph.DType:Dissertation
University:Harvard UniversityCandidate:Kim, Ryung SukFull Text:PDF
GTID:1458390008471118Subject:Biology
Abstract/Summary:
There are statistical algorithms that combine microarray expression data and genome sequence data to successfully identify the transcription factor (TF) bindings motifs in the low eukaryotic genome. In higher eukaryotes, however, finding TF binding sites is currently a challenge. Gene expression clusters found by classical methods often do not lead to successful identification of the TF bindings sites. We think current lack of success comes from three aspects.; First difficulty is in locating the relevant motifs in the promoter regions. In high eukaryotes, TF binding sites, often working in combinations, could appear in far upstream (e.g., 20,000 bases upstream from transcription starting site), in introns and even in downstream regions. However, more advanced methods for cis-regulatory analysis, e.g. using cross-species comparison, are being developed. Second difficulty is the low specificity of co-expressed genes identified by microarray analysis. We observe few identified co-expressed genes are, in fact, co-regulated. Part of the reason is the lack of performances of metrics between expression profiles and clustering algorithms which find co-expressed genes in an automated way. Third difficulty is to combine microarray analysis and cis-regulatory analysis. Regression approaches are less likely to work since the number of regulated genes is small compared to the whole genome. Other approaches, directly searching for the motifs in the promoter regions of co-expressed genes, are not promising because we observe that tightly co-expressed genes often share little known TF binding motifs in promoter regions.; In chapter one, we propose a new metric between mRNA expression profiles that correlates better with the regulatory distance than widely used metrics such as correlation or cosine correlation. In chapter two, we propose a clustering algorithm that uses repeated sub-sampling to distinguish candidate clusters and scattered genes and also require each cluster to maintain quality in original feature distances. High specificity of clusters are validated through simulations studies. In chapter three, we apply new metric and clustering algorithm to microarray data and propose a new approach to combine the result with cis-regulatory analysis to identify relevant transcription factors.
Keywords/Search Tags:Microarray, Transcription, New, Clustering algorithm, TF binding, Cis-regulatory analysis, Data, Combine
Related items