With the development of computer and the advent of data era,a large amount of ultra-high dimensional data is generated.The evaluation and selection of these ultra-high dimensional data is highly dependent on the model settings.In particular,parametric models may lead to biased estimation and variable selection once they are incorrectly set.On the other hand,nonparametric models,such as deep learning,may lead to uninterpretable and unstable estimation results.Thus semi-parametric modeling becomes a sensible compromise.The multi-index model inspired by dimensionality reduction is a semi-parametric model with both explanatory and predictive accuracy.In this paper,we investigate how to make full use of data and structural information as much as possible for complex high-dimensional data,and give regression models with both explanatory,computational validity and high prediction accuracy under the framework of multi-index models.In this paper,we first propose a structured multiple-index model(SMIM)for ultrahigh-dimensional data analysis based on the structural characteristics of the data.The proposed model takes many commonly used semiparametric models as its special cases in the low-dimensional case,such as stochastic frontier models,single-index model,additive-index models etc.The model has the following advantages: 1.it can be flexibly applied to a wide range of real data due to its shallow deep learning structure;2.its index structure can be used to identify important risk factors related to the outcome variables and the degree of influence with good explanatory power;3.it can effectively utilize the structural features behind the data to improve the estimation efficiency.However,the specific data structure may bring about nonlinear characteristics,making the estimation and theoretical derivation of SMIM with both multivariate nonparametric and high-dimensional attributes difficult.We estimate all of functions and parameters based on full likelihood-type function.As a result,the proposed estimators are shown to be semiparametrically efficient,as well as consistent in selection and estimation and asymptotically normal.The computation is challenging due to the combination of nonconvexity of the likelihood function,nonsmoothness of the penalty term,and the large number of functions.To solve the computational problem,we develop a technique of blending spline and kernel smoothing with a majorized coordinate descendent algorithm,so that the implementation is easily performed by using the existing packages.Intensive simulation studies also show that the proposed estimation procedure outperforms its alternatives for various cases.Finally,we apply the proposed SMIM together with the proposed estimation procedure to a real dataset from one of China’s largest liquor companies,and successfully find that 31 out of 2051 factors,including price,previous sales,per capita GDP,residents,are important for mean,stochastic frontier,inefficiency and variance of the liquor sales.The analysis results can help enterprises reduce costs and improve profits,and can be applied to similar case analysis in other fields.On the other hand,a typical example of high-dimensional data is highdimensional metabolites.Studying how therapeutic variables,such as nutritional intake,affect adolescent growth and development through metabolites(mediators)is an effective way to understand metabolites and is one of the scientific questions to explore the process of human growth and development.Causal pathways provide a powerful structural tool for analyzing high-dimensional mediators.Existing pathway analyses of high-dimensional mediating variables are mainly divided into univariate analyses that ignore correlations and multidimensional mediator analyses based on principal component dimensionality reduction.The former retains explanatory but leads to poor prediction due to ignoring correlations;the latter retains some predictive effect but lacks explanatory.In this paper,we propose a multi-level highdimensional semiparametric structural equation model(m SSEM)with both explanatory and predictive properties.Our model has a high predictive accuracy due to the supervised grouping and variable selection of high-dimensional mediators through double dimensionality reduction and index modeling,which retains explanatory while the model has a high flexibility and data adaptability.We propose a penalized least squares approach to estimate parameters and unknown functions and to automatically identify path patterns of highdimensional mediators.A simple implementation procedure is developed with the help of existing well-developed packages.We establish key theoretical results for the proposed estimators including parameters and unknown functions.Extensive numerical simulations show that the m SSEM proposed in this paper significantly outperforms existing methods in terms of interpretation,estimation,and prediction.We analyze a metabolomics study using the proposed m SSEM and find that fat intakes have negative mediation effect for insulin resistance through metabolites,besides,boys have lower insulin resistance than girls under the same other factors,i.e.,girls are more likely to develop diabetes than boys.More importantly,based on our proposed method,we can explore unknown metabolite mechanisms by using familiar metabolites mechanisms to explain unknown metabolites through supervised clustering results.For example,the metabolite Estradiol Valerate(EV)is clustered in the same group as familiar sphingomyelins(SMs)and ceramide,i.e.,indicating that the mechanism of EV is the same as the latter two,which would allow experiments to study EV alone to be avoided,reducing costs to a large extent and providing important guidance for further relevant studies. |