Font Size: a A A

Research And Application Of Statistical Methods Of Multi-omics Prediction Model Based On Survival Outcome

Posted on:2020-09-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:S P ShenFull Text:PDF
GTID:1364330596483761Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
In the big data era,with the development of molecular biology,research on complex diseases has developed into a multi-omics era.The omics data is a kind of biomedical big data,which has the characteristics of ultra-high dimension and small sample size(p>n),which gives new challenges to traditional statistical methods.Complex diseases are generally thought to be caused by the interaction of external environmental factors and intrinsic genetic characteristics,while genetic characteristics include a variety of omics data from top to bottom according to the central rule,such as genetic variation,DNA methylation,gene expression,miRNA expression and protein.A comprehensive understanding of different omics is critical to uncover the development and mechanism of the disease.In medical clinical research,omics is closely related to patient’s disease progression and outcome.Therefore,omics data is often used as biomarkers to predict a patient’s outcome.However,using only a single variable(locus,gene,etc.)for prediction is often less effective than integrating multiple variables or multiple omics.In this study,we focus on the survival outcomes of cancer patients,and study the unsupervised and supervised predictive model construction methods based on common omics data.Each part consists of simulation and real data analysis.Part Ⅰ is based on an algorithm for unsupervised category omics prognosis prediction models and methods comparison.We propose an algorithm for the unsupervised multi-omics integrated prediction model,Random partition fusion based on K-means(RPFKM),and compared with the current common algorithm[K-means,hierarchical clustering,Gaussian mixture model classification based on EM algorithm,iCluster and Similarity network fusion(SNF)].We then performed a multi omics simulation that simulated three common types of omics data:multivariate normal distribution,beta distribution between 0-1,and three-class qualitative data.Through simulation,we found that when using the normalized mutual information and adjusting the rand index to evaluate the classification effect,the overall classification effect increases with the proportion of positive variables,the effect of the positive variables,and the difference between groups.Among them,the classification effect of RPFKM is better than other methods,especially when the proportion of positive variables is low.When we used C-index to evaluate the outcome predictive ability,the predictive power of RPFKM was higher when the proportion of positive variables was lower,and the difference was not significant when the proportion of positive variables was higher.In the real data analysis,we used The Cancer Genome Atlas(TCGA)pan-cancer data to comprehensively evaluate the predicted performance of the RPFKM algorithm in each tumor.We included the gene expression and DNA methylation of the tumor tissues corresponding to the immune gene profile of the ImmPort database,as well as miRNAs with extensive regulatory relationships.The results showed that the prognosis of patients was significantly recognized in most tumors,with a high C-index,with an average C-index of 0.668 and a standard deviation of 0.084 in all tumors.We also found that classification is not significant in some tumors,possibly due to interactions with other environmental factors beyond molecular biology.Part Ⅱ is a methodological evaluation of omics supervised prognostic prediction models.We systematically evaluate six existing supervised prognostic predicting models,including univariate screening,penalized regression methods[least absolute contraction and selection operator(LASSO),elastic net(ENET),sure independent screening(SIS)]and machine learning methods(random forest,CoxBoost).In the simulation,we simulated the multivariate normal distribution data,and set up three scenarios including independent variables,dependent variables and actual data structure according to the data covariance structure.LASSO,ENET and CoxBoost have better prediction performance evaluated by C-index,R~2,true positive rate(TPR)and false negative rate(FNR).However,in terms of the false discovery rate(FDR),all variable screening methods had a high FDR which cannot be ignored.In the real data analysis,we analyzed three omics data including methylation,gene expression and clinical characteristics of oral squamous cell carcinoma(OSCC).To identify methylation features in OSCC,we used a multi-stage screening strategy.Firstly,we establish a prognostic model for DNA methylation.The TCGA OSCC cohort was used as a training set,while two independent data set were selected from the GEO database as validation sets.Then we explore the relationship between methylation and gene expression and the association of expression and cancer overall survival.Finally,mediation analysis was used to investigate the causal relationship between DNA methylation,gene expression,and patient outcome.In this study,7 CpG loci were finally screened to establish a prognostic model,which can significantly predict the survival outcome of patients[training set:hazard ratio(HR)=3.23,P=5.52×10-10;validation set 1:HR=2.79,P=0.010;validation set 2:HR=3.69,P=0.011].The ROC curve shows that the model has certain predictive ability.Corresponding gene expression(AJAP1,SHANK2,FOXA2,MT1A,ZNF570,HOXC4,and HOXB4)of seven CpG sites were also significantly associated with OSCC overall survival.Mediation analysis indicated that the effect of methylation on prognosis was significantly mediated by gene expression.Integrating DNA methylation,gene expression,and clinical data provides the best prognostic predictive power(AUC=0.78).The identification of DNA methylation and gene expression biomarkers might help improve the early diagnosis and survival prediction of OSCC,and add evidence for clinical adjuvant therapy,providing a basis for precision medicine.
Keywords/Search Tags:multi-omics, predictive model, survival outcome, unsupervised, supervised, cancer
PDF Full Text Request
Related items