Font Size: a A A

Novel Statistical Methods for Gene Set Enrichment Analysis with Empirical Memberships for Overlapping Gene

Posted on:2019-01-20Degree:Ph.DType:Thesis
University:University of RochesterCandidate:Zhang, YunFull Text:PDF
GTID:2470390017988791Subject:Statistics
Abstract/Summary:
Gene Set Enrichment Analysis (GSEA) is a powerful inferential tool that incor- porates knowledge of a priori defined gene sets (e.g. molecular pathways) into the high-throughput data analyses. Knowledge-based gene sets are available in bioinfor- matics resources such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. In databases built for general purposes, multifunctional genes are assigned to a number of pathways simultaneously. For study-specific analyses (e.g. a specific disease), these genes overlapped in multiple pathways are counted multiple times no matter if their signals are associated with the disease or not. However, most existing methods ignore the effect of the overlapping genes in GSEA. In this thesis, we re- veal the substantial overlapping in KEGG pathways. We show that the overlapping genes present pathway-specific activations under the study-specific condition. Further, we computationally decompose the overlapping genes using study-specific data and develop appropriate similarity measures to assign their pathway memberships empirically. Unlike the traditional binary membership (i.e. either 0 or 1), the empirical membership is quantified using continuous weights. We design novel GSEA methods for two types of data: time-course data and data with limited time points (e.g. cross-sectional data). The former data contain rich temporal information in individual subjects, which have the potential to lead to personalized inference for precision medicine diagnosis. The later data have simpler structure and are available from the vast majority of studies. By using functional data analysis and high-dimensional statistical learning tools, we build the functional model and the cross-sectional model with respect to the above data types. Upon obtaining the weights (a.k.a. empirical memberships), we also derive two generalized hypothesis tests (i.e. one parametric test and one nonparametric test) that accommodate both weights and inter-gene correlation for the pathway-level test. In contrast to the classical tests, these generalized tests not only are more flexible, but also enormously reduce the computational burden for various applications of high-throughput data. For each new method, we conduct simulation studies and demonstrate through real data analyses. Lastly, all developed work are implemented with efficient algorithms in R packages that are publicly available.
Keywords/Search Tags:Data, Gene, Overlapping, GSEA, Empirical, Memberships, Methods
Related items