Font Size: a A A

Comparison of clustering algorithms for gene expression microarray data

Posted on:2004-07-23Degree:Ph.DType:Thesis
University:Boston UniversityCandidate:Liao, Hsin-IFull Text:PDF
GTID:2458390011455867Subject:Biology
Abstract/Summary:
Gene expression microarrays are a revolutionary high-throughput technology with enormous promise to help geneticists understand and explore the genome. Cluster analysis was one of the first statistical techniques applied to microarray data. In fact, microarray data can be treated as a matrix: rows represent gene variables and columns represent subjects. Generally there are two types of experimental designs: a single subject with multiple observations over time; and multiple subjects with single time. Here subjects can be treated as cell lines, experiments, or drug treatments, etc. In addition, the matrices can be studied through two aspects: comparing RNA expression profiles of genes by comparing rows; and comparing profiles of subjects by comparing columns. The latter case is familiar to statisticians who are used to dealing with data featuring small to medium number of variables and large number of subjects. The former case is new. Here researchers and statisticians have to deal with hundreds or even thousands of variables and, usually, a small number of subjects. The emphasis is on clustering genes with similar expressions over multiple observations. In a genetic sense, if two genes are similarly expressed, we can hypothesize that the two genes are related functionally. This thesis concerns itself with this situation.; Various clustering procedures have been developed for extracting the maximum amount of valid information from microarray data. Predating the genetic era, Cureton and D'Agostino created a clustering algorithm based on principal components and factor analysis concepts. We investigate its behavior on microarray data. It is compared to procedures frequently used in microarray analysis, such as hierarchical clustering, K-means, and self-organizing map. All algorithms are applied to both simulation study and the genes of Sacharomyces cerevisiae (budding yeast). The clustering results are evaluated using external and internal criteria. The external statistic is focused on the agreement between estimation and true labels while the internal statistic is on the distance, such as maximizing between clusters and minimizing within clusters. The Cureton and D'Agostino procedure exceeds the other procedures when judged by the internal criterion. It is, however, not as good on the external criterion. A recommendation for practical use is suggested which combines the Cureton and D'Agostino methods with the other methods to address both the internal and external criteria.
Keywords/Search Tags:Microarray, Clustering, Expression, Cureton and d'agostino, External, Internal
Related items