| Objective:Joint analysis based on gene sets and multiple traits can be used to identify biological pathways associated with complex diseases and explore the biological functions and processes behind biological pathways.Among them,gene sets-based analysis can improve the ability to identify genetic effects related to complex diseases and reduce the loss of heritability.Multi-trait anal-ysis is often used to locate genetic effects related to complex traits to reveal genetic patterns and mechanisms of traits,which has higher statistical power and accuracy than single trait analysis.The purpose of this study is to develop a high-dimensional statistical inference framework that can detect the associa-tion between gene sets and multi-trait responses based on the composite hypoth-esis test of the predictor envelope model(X-envelope),using the good parame-ter estimation effectiveness and high statistical power of the predictor envelope model(X-envelope),to enrich multi-trait multi-locus genetic association anal-ysis methods.Methods:Constructing a three-stage high-dimensional statistical inference frame-work based on the predictor envelope model,as follows:1.DC-SIS to screen variablesWe used the distance correlation based sure independence screening(DC-SIS)procedure to screen high-dimensional gene expression data to obtain can-didate genes.The reason is that X-envelope cannot run an ill-conditioned ma-trix[1]when fitting a model,and the covariance matrix of predictors is required to be a nonsingular matrix,that is,the sample size is greater than the number of predictors(>).So,it cannot process high-dimensional genomic data,and DC-SIS can compensate for the shortcomings of X-envelope.2.Composite hypothesisBased on X-envelope model and traditional multivariate multiple linear regression(MMLR)model,regression analysis is performed on candidate genes and multiple traits,and composite hypothesis tests is performed on a group of regression coefficients composed of gene sets and multiple traits to achieve X-envelope’s detection of complex disease gene sets on multiple traits.The results are compared with MMLR.3.False discovery rate(FDR)corrected P valueBased on the false discovery rate(FDR),the P value of the composite hy-pothesis test is corrected to control the false positive rate of the multiple tests.4.Simulation studiesTo evaluate the statistical inference performance of our multi-stage high-dimensional statistical inference framework.In particularly,previous re-searches on the X-envelope model have mainly focused on evaluating the ef-fectiveness of parameter estimation and prediction ability of the model,while there has been little researches on the type one error and power of the X-enve-lope model.Through a large number of simulation experiments,this study esti-mated type one error and power of a multi-stage statistical inference framework based on the X-envelope model(see the Methods section for details).At the same time,we compared the test results of the PLSR combined with Fisher P values integration method to evaluate the statistical inference ability of the X-envelope model.5.Real data analysisIn this part,we used a real data to validate three-stage high dimensional statistical inference framework based on the predictor envelope model.The real data analysis of this study is from the Alzheimer’s Disease Neuroimaging Initi-ative(ADNI)database.Research cases are 141 Alzheimer’s disease(AD)pa-tients aged 55 to 90 years from 2004 to 2017.The criteria for determining Alz-heimer’s patients are developed by ADNI database staff based on the National Institute on Aging(NIA)and the Alzheimer’s Disease International(ADI)re-ports.This study took the maximum value of multiple probe measurements cor-responding to the same gene in AD patients as the expression level(m RNA)data of that gene,and the structural magnetic resonance image(s MRI)of the hippocampus,ventricle,entorhinal cortex,fusiform gyrus and middle temporal gyrus five brain regions of AD patients were quantified as brain volume data as multiple traits,and finally we obtained a sample size n=141,20,068 genes ex-pression as predictors and 5 brain volumes as multiple traits of Alzheimer’s pa-tients genomic data.Then,we applied the three-stage high-dimensional statis-tical inference framework based on the predictor envelope model to analyze this data to get significance pathways of X-envelope model and traditional MMLR to evaluate the detection ability of X-envelope of gene-set effects on complex diseases based on multiple traits.The significant pathways according to false discovery rate(FDR)of two models are obtained.The results of the enrichment analysis of the traditional gene pathway enrichment analysis tool DAVID are also compared with results of these two models to further verify the detecting effect of X-envelope of gene sets on complex diseases.Finally,the relevant lit-erature is searched to verify the analysis results of the predictor variable enve-lope model.At the same time,according to the enrichment analysis results of X-envelope,the gene pathways related to brain atrophy of Alzheimer’s disease patients were screened to provide relevant basis for early detection and diagno-sis of AD.Finally,we searched for relevant literature to verify the real data analysis results of the X-envelope model.At the same time,based on the en-richment analysis results of X-envelope,the gene pathways related to brain at-rophy in Alzheimer’s disease were screened,providing relevant evidence for the early detection and diagnosis of AD.Finally,we conducted a pathway enrichment analysis using the PLSR combined with Fisher P values integration method,as well as a single trait-based X-envelope model pathway enrichment analysis on AD data,to compare and evaluate the performance of the proposed three stage high-dimensional sta-tistical inference framework based on multi-trait X-envelope model in this study.Results:1.The simulation results show that the type one error of the predictor en-velope model is 0.055,0.053 and 0.046 when the sample size is 150,200 and250,and the power are 0.856,0.89 and 0.924,respectively.We can see that,type one error of X-envelope composite hypothesis test has been well controlled under different sample sizes and is basically stable around the significance level of 0.05;In addition,as the sample size increases,the power of the X-envelope model gradually improves,which is consistent with our expectations.In con-trast,traditional MMLR model cannot handle the correlation between responses and the collinearity of predictors(see the Methods section for details).During model fitting,the operation was terminated due to the inability to handle the singular matrix,and no composite hypothesis test results were given.The power of the PLSR combined with Fisher P values integration method is extremely low(all less than 0.20)and type one error is uncontrollable(all greater than0.10).2.A total of 141 patients with Alzheimer’s disease are collected,and the gene expression data of 20,068 genes were used as predictors and the brain vol-ume data of 5 brain regions are used as responses.After DC-SIS variable screen-ing,84 candidate genes were obtained.The 84 candidate genes obtained by var-iable screening are fitted by the predictor envelope model and the traditional multivariate multiple linear regression model and enriched to the gene pathways.The dimension selected by AIC of the predictor envelope is=23.The pre-dictor envelope model has given out 76 significant pathways(FDR-corrected≤0.05),of which 72 GO pathways,including 43 biological processes(BP),19 cellular components(CC),10 molecular functions(MF)and 4 KEGG path-ways.Moreover,we have discovered one of the most important pathways for AD patients:the Alzheimer’s disease pathway and mapped its work.For exam-ple,under theγ-Secretase complex,the APP protein synthesized by gene APP in the pathway can be cracked to generate AICD(APP intracellular c-terminal domain),and AICD has neurotoxicity and can induce neuronal apoptosis in the presence of mitochondrial DNA.At the same time,AICD can induce neuronal apoptosis through gene expression regulation.Others like pathways which are closely related to the production and clearance ofβ-amyloid fibril protein(Aβ)andβ-amyloid precursor protein(APP)in human brain;lipoprotein related path-ways that are concerned with APOE gene variations,lipoprotein particles are involved in lipid transport in the human brain,and lipoprotein particles abnor-malities can affect the neuronal function of brain cells.Based on FDR-corrected P values,the traditional multivariate multiple linear regression model does not give out significant gene pathways.The DAVID database gives out only one significant pathway which is called as positive regulation of amyloid fiber for-mation.The statistical analysis framework based on single-trait X-envelope model selects larger significant pathways than multi-trait X-envelope model,and the P value of the hypothesis test is much greater than that of multi-trait analysis.The PLSR combined with Fisher P values integration method did not screen out significant pathways.Conclusions:1.The three-stage high-dimensional statistical inference framework based on the X-envelope model that we developed has good statistical inference per-formance and can achieve the detection of complex diseases’gene sets effects on multiple traits.2.When X-envelope models multiple predictors and multivariate re-sponses,it can not only handle the collinearity of the predictors,but also con-sider the correlation between responses.However,traditional MMLR model cannot handle the correlation between predictors and responses.Therefore,the composite hypothesis testing ability of X-envelope model is inferior to tradi-tional MMLR model. |