Font Size: a A A

Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learnin

Posted on:2016-09-02Degree:Ph.DType:Thesis
University:University of PittsburghCandidate:Kim, SungHwanFull Text:PDF
GTID:2478390017980688Subject:Biostatistics
Abstract/Summary:
Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis.;Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies.;One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters.;Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets.;In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research.
Keywords/Search Tags:Data, Supervised, Statistical, Principal component, Methods
Related items