Font Size: a A A

New clustering and feature selection procedures with applications to gene microarray data

Posted on:2009-03-31Degree:Ph.DType:Thesis
University:Case Western Reserve UniversityCandidate:Xu, YaominFull Text:PDF
GTID:2448390005460497Subject:Biology
Abstract/Summary:
Statistical data mining is one of the most active research areas. In this thesis we develop two new data mining procedures and explore their applications to genetic data.; The first procedure is called PfCluster---Profile Cluster Analysis. It is a clustering method designed for profiled genetic data. The PfCluster is efficient and flexible in uncovering clusters determined by a new class of biologically meaningful distance metrics. A new internal quality measure of clusters, coherence index, is developed to find coherent clusters. An efficient mechanism for choosing the threshold of coherent clusters is also derived and implemented. The threshold is based on the first and second order approximations to the true threshold under a null distribution for parallel clusters.; The PfCluster has been applied to simulated data and two real data examples: a biomarker LOH dataset and a microarray gene expression dataset. PfCluster is competitive to the correlation-based clustering procedures. The second procedure is called RPselection---Resampling based partitioning selection. It is a feature selection algorithm designed for microarray studies. It selects a subset of genes that maximizes a fitness score. The fitness score measures the relevance between the partition labels from a clustering result and an external class label derived from the clinical outcomes. The score is computed using a resampling procedure. The RPselection algorithm has been applied to simulated data and a real uveal melanoma gene expression data. RPselection outperforms gene-by-gene test-based feature selection procedures.; Software development is an integral part of modern statistical research. Two software packages, pfclust and rpselect, are developed in this thesis based on our PfCluster method and RPselection algorithm. Packages pfclust and rpselect are implemented based on R object-oriented programming framework, and they can be easily customized and extended by users.; The ideas in our two procedures can be generalized and applied to other data mining tasks. This thesis concludes with discussion on connections between two methods and the related future research.; Key words: Bioinformatics, coherence index, data mining, feature selection, gene expression pathway, gene profiling, informative gene, microarray data, profile cluster analysis, partitioning, regulatory network, statistical pattern recognition.
Keywords/Search Tags:Data, Feature selection, Gene, New, Microarray, Procedures, Clustering
Related items