Clustering time-course gene-expression array data

Posted on:2009-09-11

Degree:Ph.D

Type:Thesis

University:Rice University

Candidate:Gershman, Jason Andrew

Full Text:PDF

GTID:2448390005453615

Subject:Statistics

Abstract/Summary:

This thesis examines methods used to cluster time-course gene expression array data. In the past decade, various model-based methods have been published and advocated for clustering this type of data in place of classic non-parametric techniques like K-means and hierarchical clustering. On simulated data, where the variance between clusters is large, I show that the model-based MCLUST outperforms model-based SSClust and non-model-based K-means clustering. I also show that the number of genes or the number of clusters has no significant effect on the performance of these model-based clustering techniques. On two real data sets, where the variance between clusters is smaller, I show that model-based SSClust outperforms both MCLUST and K-means clustering. Since the "truth" is often not known for real data sets, I use the clustered data as "truth" and then perturb the data by adding pointwise noise to cluster this noisy data. Throughout my analysis of real and simulated expression data, I use the misclassification rate and the overall success rate as measures of success of the clustering algorithm. Overall, the model-based methods appear to cluster the data better than the non-model-based methods.;Later, I examine the role of gene ontology (GO) and using gene ontology data to cluster gene expression data. I find that clustering expression data, using a synthesis of gene expression and gene ontology not only provides clustering that has a biologic meaning but also clusters the data well. I also introduce an algorithm for clustering expression profiles on both gene expression and gene ontology data when some of the genes are missing the ontology data. Instead of some other methods which ignore the missing data or lump it all into a miscellaneous cluster, I use classification and inferential techniques to cluster using all of the available data and this method shows promising results. I also examine which ontology, among molecular function, biological process, and cellular component, is best in clustering expression data. This analysis shows that biological process is the preferred ontology for clustering expression data.

Keywords/Search Tags:

Expression, Cluster, Array data, Ontology, Biological process, Methods, Real data sets

Related items

1	Computational approaches for biological data analysis
2	Research And Application Of Clustering Algorithms For Biological Data
3	Clustering raw distributions of intensities from Affymetrix gene expression microarrays in order to evaluate statistical preprocessing methods
4	Detection of low rank signals in noise and fast correlation mining with applications to large biological data
5	Systems analysis of complex biological data for bioprocess enhancement
6	Biological Data Analysis Based On The Density Clustering And Convolution Neural Network
7	Clustering Algorithm Based On Biological Knowledge And Its Application On Gene Expression Data
8	Prediction of physicochemical properties and biological activities from molecular structure and the use of computational neural networks for the analysis of sensor array data
9	Research On Semantic Ontology Construction With Uncertain Data And The Application In Cloud Environment
10	Analysis methods for large batch and process data sets: Theory and applications