Font Size: a A A

Integrative Unsupervised Learning Based On Multi-Source Data

Posted on:2020-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y FanFull Text:PDF
GTID:1480305741464884Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the advancement of science and technology,major changes have taken place in the ways of data generation and storage.Data sources are becoming increasingly diverse.On the one hand,data from different subjects or formats are merged into volumes of datasets with abundant information.On the other hand,multidimensional profiling of data is also measured in many research domains.Due to the complexity of data,large amounts of data are unlabeled and unclassified.Labeling massive amounts of data is an extremely difficult and time-consuming task.Therefore,how to use multi-dimensional data effectively and integrate multiple datasets for unsupervised learning have already attracted widespread attention of the scholars in statistical research.Among the existed methods,principal component analysis(PCA)and graphical models are techniques of great importance in unsupervised learning.PCA,as a popular dimension reduction technique,plays an important role in high dimensional data analysis.Graphical(network)models are the fundamental parts of the studies(focusing)on conditional dependence relationships between a set of random variables.Despite.promising successes,in the high-dimensional data analysis,PCA and graphical models of a single dataset often generate unsatisfactory results with low reproducibility because of the high dimensionality and small sample size.In a series of studies in supervised learning,it has been shown that integrative analysis which provides an effective way of pooling information from multiple independent datasets and multidimensional profiling outperforms single-dataset analysis in high-dimensional data analysis.Considering the wide application of unsupervised learning methods and good performance of integrated analysis,we conduct the following unsupervised integration analysis with the multi-source and unlabeled natures of data being taken into account.(1)With multiple independent datasets,we propose conducting dimension reduction using a novel integrative sparse PCA(iSPCA)approach.To remove noise effectively and generate more interpretable results,the sparse principal component analysis(SPCA)technique has been developed.Due to the high dimensionality and small sample size,the results of sparse principal component analysis for single data are not satisfactory.We integrate multiple independent datasets for sparse principal component analysis to encourage information sharing among datasets.Penalization is adopted for regularized estimation and selection of important loadings.Advancing from the existing integrative analysis studies,we further impose contrasted penalties,which may generate more accurate estimation/selection.Multiple settings on the similarity across datasets are comprehensively considered.Consistency properties of the proposed approach are established,and effective computational algorithms are developed.A wide spectrum of simulations demonstrates competitive performance of iSPCA over the alternatives.Two sets of data analysis further establish its practical applicability(2)We consider an approximate single factor integrative graphical model(SFIG)to pool and jointly analyze multiple datasets.It has long been acknowledged that the existence of common factors increases the density of associations dramatically.The single factor graphical model has been proposed to tackle this problem using a two-step strategy,that is,extract the common factor first and then conduct graphical modeling.In the approximate single factor graphical model,the increased number of parameters makes the "lack of information" problem more severe.In order to improve the performance of the approximate single factor graphical model,we integrate multiple datasets and conduct the approximate single factor graphical model analysis.Penalization is adopted for regularized estimation and identification of important loadings and edges.Efficient computational algorithms are proposed.A wide spectrum of simulations and the analysis of three breast cancer gene expression datasets demonstrate the competitive performance of the proposed method.(3)We develop a multidimensional integrative graphical model of core variables approach(MIGM).With the development of data acquisition technology,the collection of multidimensional data becomes possible and more and more convenient.In addition to the core variables(variables of interest),other auxiliary information can be obtained for the same sample.Taking gene expression data as an example,we propose the MIGM which can effectively use information in regulators to improve the estimation of gene expression graphical structure.We propose an alternative estimate of the covariance matrix based on the regulation relationship to conduct graphical modeling.The proposed approach has an intuitive formulation and can adaptively accommodate different regulator scenarios.The consistency properties are rigorously established.Extensive simulations and the analysis of a breast cancer gene expression dataset demonstrate the practical effectiveness of the MIGM.(4)We proposed a multidimensional integrative graphical model based on the conditional score matching estimator(iSME).In some cases,we care about both the network structure of the core variables given the auxiliary variables and the direct impact of the auxiliary variables on the core variables.Existing research still has limitations either in computational complexity or in the ability to estimate the direct effect of auxiliary variables on core variables.To overcome these limitations,the regularized conditional score matching loss function is used to estimate the network structure of the core variables and the direct influence of the auxiliary variables on the core variables.The proposed approach is intuitive and easy to calculate.Consistency properties of the proposed approach are also established,and effective computational algorithms are developed.A wide spectrum of simulations demonstrates competitive performance of iSME over the alternatives.Data analysis further establish its practical applicability.
Keywords/Search Tags:Integrative analysis, Sparse PCA, Graphical model, Approximate single factor model, Contrasted penalization, Score matching loss
PDF Full Text Request
Related items