Font Size: a A A

Data compression and its statistical implications, with an application to the analysis of microarray images

Posted on:2002-08-21Degree:Ph.DType:Thesis
University:University of California, BerkeleyCandidate:Jornsten, Rebecka JennyFull Text:PDF
GTID:2468390014950329Subject:Statistics
Abstract/Summary:
This thesis consists of three parts. Even though each part is self-contained, a common theme runs through all of them: data compression and its implications for statistical inference. In particular, we consider the following three questions. How can we quantify the effect of compression on statistical inference? How should a compression scheme be designed such that the effect of compression on inference is minimal? How can the Minimum Description Length (MDL) principle be used for model selection with an extraordinary number of dependent predictors? In this thesis, we attempt to answer these three questions in a general setting, and with a specific application in the compression and analysis of microarray images.; In the first part of the thesis, we present new results in the context of multiterminal data compression. We derive an improved upper bound on the asymptotic estimation efficiency under rate constraints. Furthermore, we give a geometric interpretation of the new bound, which provides insights into the nature of the multiterminal estimation problem. The bound on asymptotic estimation efficiency gives a gold standard, by which practical compression schemes can be evaluated, and the effect of compression on estimation analyzed.; In the second part, we present a progressive lossy and lossless compression scheme for microarray images. The microarray image technology makes possible the simultaneous measurement of expression levels of thousand of genes. These images have become the standard tools to investigate fundamental biological functions such as gene regulation and interaction, and to discover genetic pathways for diseases such as cancer. They are widely used in laboratories of academia and industry, producing vast quantities of image data. Our compression scheme has been tailored to the microarray image application, such that the essential statistical information in the images is well-preserved at low bit-rates. The compression scheme has a multi-level coded data structure, which allows for fast re-processing and transmission of image subsets.; The information extracted from microarray image experiments provides statisticians with formidable data analysis tasks. In current research, particular attention has been given to the problems of gene clustering and sample classification. Each microarray experiment, or sample, corresponds to a type of tissue, tumor or stage of development. In gene clustering, the goal is to identify genes that exhibit similar expression levels across samples or experiments. In sample classification, a collection of gene expressions is used to build a predictive model for the sample type. In the third part of the thesis, we present a new MDL (Minimum Description Length) model selection criterion for the simultaneous clustering of genes, and selection of subsets of gene clusters that function as sample class predictors. For the first time, an MDL selection criterion is given for both predictor variables (genes) and response variables (sample class labels). We are able to build parsimonious classifiers using our MDL model selection criterion that performs better, or as well as the best methods reported in the literature. Our MDL model selection criterion is generally applicable to prediction problems with highly correlated predictors, where the number of predictors significantly exceeds the number of samples.
Keywords/Search Tags:Compression, Microarray image, Sample, Statistical, Model selection criterion, MDL, Gene, Application
Related items