Some methods for analyzing high-dimensional genomic data

Posted on:2010-06-08

Degree:Ph.D

Type:Thesis

University:Stanford University

Candidate:Nowak, Gen

Full Text:PDF

GTID:2448390002987653

Subject:Biology

Abstract/Summary:

This thesis focuses on two statistical methods that were motivated by problems arising in the field of genomics. These two methods, Complementary Hierarchical Clustering and the Fused Lasso Latent Feature Model, are designed to analyze high-dimensional genomic data, specifically, gene expression microarray data and aCGH data, respectively.;A phenomenon frequently observed when clustering RNA samples based on their gene expression profiles is the clustering pattern being dominated by a group of highly differentially expressed genes that have similar patterns of expression. However, the function of these genes is often already known, and additionally, these genes can obscure the effects of other genes that have potentially novel functions. We propose a procedure called Complementary Hierarchical Clustering that is designed to uncover the structures arising from such other genes. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called Relative Gene Importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and with significantly differing distant metastasis-free probabilities.;Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms a single-sample method when the simulated samples share common information. We also demonstrate two methods for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and STAC identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying three distinct groups of samples based on their patterns of CNV for chromosome 17.

Keywords/Search Tags:

Data, Methods, CNV, Genomic, Analyzing, Samples

Related items

1	Statistical methods for analysis of graph-constrained genomic data
2	Kernel machine methods for analysis of genomic data from different sources
3	Machine Learning for High Throughput Genomic Data Analysi
4	Genomic data mining enhanced by symbolic manipulation of Boolean functions
5	Signal Processing Methods for Genomic Sequence Analysis
6	Statistical methods for analyzing multiple race response data
7	Formal methods for genomic data integration
8	Research On Imbalanced Data Classification Methods For Unsafe Samples
9	Genomic data analysis and processing with signal processing techniques
10	Cloud-scale Genomic Signal Processing for Robust Microarray Data Analysis