Font Size: a A A

Statistical methods for analyzing 'omics data with emphasis on point-mass mixtures

Posted on:2010-01-24Degree:Ph.DType:Thesis
University:University of California, DavisCandidate:Taylor, Sandra LynnFull Text:PDF
GTID:2440390002477807Subject:Biology
Abstract/Summary:
Increasingly researchers are conducting high-throughput experiments involving transcriptomics, metabolomics, and proteomics data, collectively 'omics data. One particularly active area of research involves treating 'omics measurements as quantitative traits and studying associations between trait variation and discrete variables such as treatment groups or genotypes. 'Omics data can have highly variable distributions that do not conform to the assumptions of standard statistical methods. As a result, use of standard statistical methods can have reduced power to detect biologically meaningful signals. Here, I develop and evaluate methods for hypothesis testing and mapping quantitative trait loci (QTL) when the data are not normally-distributed with particular emphasis on data distributed as a point-mass mixture.;For hypothesis testing, I propose a novel empirical likelihood ratio test (LRT) statistic for simultaneously testing the null hypothesis of no difference in point-mass proportions and no difference in means of the continuous component. I evaluate the performance of the empirical LRT and three existing point-mass mixture statistics: (1) Two-part statistic with a t-test for testing mean differences (Two-part t), (2) Two-part statistic with Wilcoxon test for testing mean differences (Two-part W), (3) parametric LRT. In analyzing metabolomics data from Arabidopsis thaliana, I found that all four point-mass mixture statistics identify more significant differences than standard t-tests and Wilcoxon tests. Through simulations I found the parametric LRT to be the most powerful test when the model assumptions were correct. However, the empirical LRT, which does not require parametric assumptions, provided an attractive alternative to parametric and standard methods when the data came from widely varying distributions.;To evaluate point-mass mixtures in the context of QTL mapping, I propose a novel two-part composite interval mapping (CIM) method. I compared the new method to existing normal and binary CIM methods through an analysis of metabolomics data from Arabidopsis thaliana and a simulation study. I found that my two-part CIM has greater power and a lower false positive rate than the other methods when a continuous phenotype is measured with many zero observations. The advantages of the two-part method were most apparent when the difference in the means and point-mass proportions were in opposite directions.;Finally, I extended an empirical likelihood method for estimating QTL effects to also estimate QTL locations, a primary objective of QTL mapping studies. My approach yielded similar estimates for QTL locations as normal likelihood and non-parametric methods for a single 20 cM interval. However, when applied to multiple intervals with closer marker spacing (5 or 10 cM), while the empirical likelihood approach provided essentially unbiased estimates for the QTL location, it was less efficient than normal likelihood interval mapping. Also, because the solution for the empirical likelihood LOD score has its support on the observed data, evidence for a QTL could not be evaluated in some intervals due to an insufficient number of recombinants.;Together, these new methods offer a range of new tools for the analysis of 'omics data as quantitative traits.
Keywords/Search Tags:Data, Methods, Point-mass, QTL, LRT, Empirical likelihood
Related items