With the rapid development of modern information technology,the means and ability of people to collect data have also been significantly improved,and the emergence of massive data has opened the curtain of the era of big data.In these massive data,people often encounter a type of data with high dimensionality,which is called ultra-high dimensional data in statistics.In the analysis of ultra-high-dimensional data,there are few covariates that really have an important impact on the dependent variable,that is,the so-called sparsity of the data,which brings great challenges to traditional statistical inference and numerical computation.If all of these data information is used to build a regression model,it will inevitably lead to a lot of time and economic costs,and even lead to unreasonable or even wrong conclusions.Therefore,how to filter the important features of ultra-high-dimensional data without losing useful information has become a major concern of academia and industry.Due to various reasons,such as: respondents are reluctant to answer certain sensitive questions due to privacy concerns,respondents who are tracked die,and respondents who are tracked on business trips,etc.,some data are often missing in ultra-high-dimensional data.When the data corresponding to some variables is missing,simply deleting all the data information of the respondents and only using the data information of the completely observed respondents for statistical inference may lead to unreasonable or even wrong statistical conclusions.Therefore,screening important covariates for ultra-high-dimensional data with missing data is a hot and challenging topic in modern statistical research.Existing variable screening methods almost all focus on ultra-high-dimensional data in which the response variable or covariate is continuous and fully observed,and rarely discusses the case where the response variable and covariate are categorical and missing.This dissertation focuses on the response variable.Or covariates are categorical variables and contain missing data to discuss the variable selection,and its research has not only theoretical significance but also important practical application value.The main research contents of this dissertation include:1.The conditional mean variance screening of Cui et al.(2015)whose response variable is a categorical variable is extended to variable screening of ultra-high-dimensional data whose covariates are categorical variables.Based on the assumption of sparsity in the data,a variable screening method based on empirical conditional distribution functions is used to screen categorical covariates.This method mainly considers the difference between the distribution of the response variable Y and the conditional distribution of Y given the covariate X to measure the correlation between the covariate and the response variable.Simulation studies show that,regardless of the linear relationship between X and Y or a more complex nonlinear relationship,the method can effectively select the explanatory variables that have an important contribution to the response variable among the nominal ultra-high-dimensional covariates.2.In view of the random missing of covariates,the variable screening of ultra-high dimensional data is discussed,and a two-step screening method is proposed to screen covariates in the missing data mechanism model and covariates that have an important impact on response variables.On the basis of assuming that the data have sparsity characteristics,this dissertation further assumes that the absence of explanatory variables is only related to fully observed response variables and partially fully observed covariates.The two-step method first uses the PC-SIS method to screen out the covariates related to the missing indicator;after the first step of screening,the dimensionality of the explanatory variables is effectively reduced.The effective variable set obtained in the first step is to estimate the joint probability of covariates with missing observations and response variables,and further define the corresponding screening statistic of ultra-high-dimensional variables,and select from the set of nominal covariates with random missing values.Important valid variables.3.A large number of simulation studies verify the effectiveness of the method proposed in this dissertation.Next,the specific application of the method proposed in this dissertation is illustrated with the data from the tourism quality survey.These numerical results show that the model-free method is stable,performs well,and is robust to the heavy-tailed distribution of the explained variables and the existence of potential outliers.In the case of a certain degree of missing data set,this method can still effectively extract effective information in ultra-high-dimensional data. |