New clustering and feature selection procedures with applications to gene microarray data

Posted on:2009-03-31

Degree:Ph.D

Type:Thesis

University:Case Western Reserve University

Candidate:Xu, Yaomin

Full Text:PDF

GTID:2448390005460497

Subject:Biology

Abstract/Summary:

Statistical data mining is one of the most active research areas. In this thesis we develop two new data mining procedures and explore their applications to genetic data.; The first procedure is called PfCluster---Profile Cluster Analysis. It is a clustering method designed for profiled genetic data. The PfCluster is efficient and flexible in uncovering clusters determined by a new class of biologically meaningful distance metrics. A new internal quality measure of clusters, coherence index, is developed to find coherent clusters. An efficient mechanism for choosing the threshold of coherent clusters is also derived and implemented. The threshold is based on the first and second order approximations to the true threshold under a null distribution for parallel clusters.; The PfCluster has been applied to simulated data and two real data examples: a biomarker LOH dataset and a microarray gene expression dataset. PfCluster is competitive to the correlation-based clustering procedures. The second procedure is called RPselection---Resampling based partitioning selection. It is a feature selection algorithm designed for microarray studies. It selects a subset of genes that maximizes a fitness score. The fitness score measures the relevance between the partition labels from a clustering result and an external class label derived from the clinical outcomes. The score is computed using a resampling procedure. The RPselection algorithm has been applied to simulated data and a real uveal melanoma gene expression data. RPselection outperforms gene-by-gene test-based feature selection procedures.; Software development is an integral part of modern statistical research. Two software packages, pfclust and rpselect, are developed in this thesis based on our PfCluster method and RPselection algorithm. Packages pfclust and rpselect are implemented based on R object-oriented programming framework, and they can be easily customized and extended by users.; The ideas in our two procedures can be generalized and applied to other data mining tasks. This thesis concludes with discussion on connections between two methods and the related future research.; Key words: Bioinformatics, coherence index, data mining, feature selection, gene expression pathway, gene profiling, informative gene, microarray data, profile cluster analysis, partitioning, regulatory network, statistical pattern recognition.

Keywords/Search Tags:

Data, Feature selection, Gene, New, Microarray, Procedures, Clustering

Related items

1	The Research Of Gene Selection And Clustering Method In Gene Microarray Data Analysis
2	Comprehensive data analysis for biomarker pattern discovery using DNA /protein microarray
3	Data Analysis Of Expression With Gene Microarray And Investigation For Gene Regulatory Networks
4	Design And Implementation Of Gene Microarray Data Classification System
5	Research On Vectorized Representation Of Discriminative Capability Of Gene And Gene-based Clustering
6	Research On Feature Selection For Classification In Microarray Gene Expression Data
7	Research On Relevant Problems Of DNA Microarray Expression Data Analysis
8	Study On Selection For Feature Gene Subset In Microarray Expression Profiles Based On A SVM And GA Hybrid Algorithm
9	Gene Microarray Data Classification Based On Tolerance Rough Sets
10	Microarray Data Clustering Algorithm