Font Size: a A A

Statistical methods for gene expression analysis fromcDNA microarrays

Posted on:2002-07-06Degree:Ph.DType:Dissertation
University:University of California, BerkeleyCandidate:Bryan, Jennifer FrazierFull Text:PDF
GTID:1460390011498138Subject:Biology
Abstract/Summary:
Recent developments in microarray technology make it possible to capture the gene expression profiles for thousands of genes at once. With this data researchers are tackling problems ranging from the identification of “cancer genes” to the formidable task of adding functional annotations to our rapidly-growing gene databases. Specific research questions suggest patterns of gene expression that are interesting and informative, for instance, genes with large variance or groups of genes that are highly correlated. Cluster analysis and related techniques are proving to be very useful. However, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Given the high-dimensionality (thousands of genes) and small sample sizes (often <30) encountered in these datasets, an honest assessment of sampling variability is crucial and can prevent the over-interpretation of spurious results. We describe a statistical framework that encompasses many of the analytical goals in gene expression analysis; our framework is completely compatible with many of the current approaches and, in fact, can increase their utility. We propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. In addition to subset membership, the target subset can include information about relationships between genes, such as clustering. This target subset presents an interesting parameter that we can estimate by applying the rule to the sample statistics of microarray data. The parametric bootstrap, based on a multivariate normal model, is used to estimate the distribution of these estimated subsets and relevant summary measures of this sampling distribution are proposed. We focus on rules that operate on the mean and covariance. Using Bernstein's Inequality, we obtain consistency of the subset estimates, under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. We also provide a conservative sample size formula guaranteeing that the sample mean and sample covariance matrix are uniformly within a distance ε > 0 of the population mean and covariance. The practical performance of the method using a cluster-based subset rule is illustrated with simulation studies and with an analysis of a publicly available leukemia data set. We describe extensions of the method to settings in which multiple populations are compared or gene expression is measured over time or at different values of a covariate.
Keywords/Search Tags:Gene expression, Statistical
Related items