Cluster analysis of gene expression data

Posted on:2002-09-08

Degree:Ph.D

Type:Dissertation

University:University of Washington

Candidate:Yeung, Ka Yee

Full Text:PDF

GTID:1468390011991734

Subject:Computer Science

Abstract/Summary:

The invention of DNA microarrays allows us to study simultaneous variations of genes at the genome-wide scale. A typical gene expression data set consists of thousands or even tens of thousands of genes, and a few dozens experiments. Cluster analysis is the art of finding groups in a given data set such that objects in the same group are similar to each other while objects in different groups are dissimilar. There are many applications for clustering gene expression data.; Many different clustering algorithms and analytical techniques have been applied to gene expression data. Success of various analytical methodologies in specific instances has been reported, but extensive quantitative evaluations of clustering methodologies are rare. Since different analytical approaches may produce different clustering results, there is a great need to evaluate clustering techniques in order to choose an appropriate approach. An underlying theme of this dissertation is systematic evaluations of clustering methodologies on gene expression data. Specifically, we proposed a data-driven methodology, called the figure of merit (FOM) methodology, to compare the quality of clusters from heuristic-based clustering algorithms. We also showed that the model-based clustering approach, which assumes the Gaussian mixture model, produces relatively high quality clusters. The probabilistic framework in the model-based approach allows us to infer the correct number of clusters, and to compare different models. Moreover, we investigated the effectiveness of a dimension reduction technique called principal component analysis as a pre-processing step before cluster analysis.; Our main contributions are evaluation methodologies of analytical techniques in clustering gene expression data. We employed an external validation approach, which evaluates clustering results by comparing to external prior knowledge of the data, to assess the performance of internal validation approaches, which do not require any external knowledge of the data. In particular, we showed that our FOM methodology and the model-based approach, which do not require any external knowledge of the data, produce comparisons of clustering algorithms that are consistent with comparisons to external knowledge. Since external knowledge is seldom available for gene expression data, our work provides practical evaluation frameworks for assessing clustering results on gene expression data.

Keywords/Search Tags:

Gene expression data, Clustering, Cluster analysis, External knowledge

Related items

1	The Research And Application On Gene Expression By Clustering Algorithms
2	Clustering Algorithm Based On Biological Knowledge And Its Application On Gene Expression Data
3	Gene Microarray Data Analysis Based On Clustering Algorithms
4	Study Of Gene Expression Data Analysis Based On Pattern Recognition Methods
5	The Research And Implementation On Clustering Algorithm Of Gene Expression Data
6	Research On Clustering Methods For Analyzing Overlapping Local Gene Expression Patterns
7	The Research And Application Of Particle Swarm Optimization Algorithm In Clustering Analysis Of Gene Expression Data
8	The Design And Analysis Of Clustering Algorithms On Gene Expression Data
9	Research On Clustering Algorithms In Gene Expression Data Analyzing
10	Clustering Analysis Based On The Ant System For Gene Expression Data