Robust Variable Selection And Feature Screening Methodology And Application

Posted on:2023-02-20

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Pan

Full Text:PDF

GTID:1520306632954689

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

In the era of big data,data sources are becoming more abundant.From the economic and financial fields to the environmental ecology,medical and health fields,various industries are gradually adopting professional and technical methods to collect and sort data.The development of computer technology also makes large quantities of data effectively stored and utilized.In addition,a variety of digital technology methods have also emerged,so that text,images and even sound and other information can also be converted into digital information.Technical methods have brought more convenience to statistical analysis,but consequently,high-dimensional and ultra-high dimensional data have also appeared in these fields,making traditional statistical analysis methods face unprecedented challenges.For example,in the medical and health field,analyzing the causative factors of a certain major disease such as cancer or the genetic mechanism of a certain genetic disease often needs gene detection technology to obtain gene expression data,and then uses the gene expression data of diseased samples to detect disease-related Important genes.In this type of research,there are often only dozens of research subjects(patients)in practical applications.However,the gene bank is a huge system,which may have thousands or even several,which is a typical ultra-high dimensional data.Since the outbreak of the new crown epidemic,many scholars have made predictions on the development trend of the new crown epidemic,such as using the Baidu search index.In this way,more characteristic variables can be included in the scope of the study.In fact,the Baidu search index is also widely used in many other issues such as GDP and CPI predictions,which all show that high-dimensional problems have gradually penetrated into all fields of society.Because the covariance structure under high-dimensional data is more complicated,and the thick-tailed features are more obvious,especially the number of features under ultra-high dimensional data is much larger than the sample size,singular matrix problems often occur in the estimation process,which causes the algorithm to not converge or even "no solution".In addition,under the high-dimensional setting,the problem of false correlation(pseudo correlation)cannot be ignored,which usually causes the results of correlation analysis to be unreliable.Therefore,how to solve the problems faced by high-dimensional data should be a major breakthrough in statistical analysis.High-dimensional and ultra-high dimensional variable selection and feature selection theoretical research has also become one of the hot issues that modern statisticians pay attention to.This article explores and applies the theory of variable selection and feature selection under high-dimensional and ultra-high dimensional data from the following aspects.First of all,this article studies the method of variable selection based on nonparametric and semi-parametric models.When the dimensionality of the variable is high,the relationship between the covariate and the response variable may not be accurately described by ordinary parametric models,while non-parametric semiparametric models are more flexible.Therefore,such models are more and more widely used in various statistical analysis fields.As a kind of nonparametric models,VICM can allow the coefficients of the index variables to vary with the independent variables,thereby revealing more dynamic relationships between variables.In addition,a slight adjustment to the parameter structure of VICM can make it transformed into other non-parametric or semiparametric models,which reveals the better generalization ability of VICM.Therefore,this article introduces the VICM model,explores the PSLE method of model estimation and the two-step estimation method based on backfitting algorithm.Furthermore,we innovatively introduces the SCAD penalty function into VICM,and uses the BIC criterion to select the optimal adjustment parameters,which realizes the variable selection of the high-dimensional varying index coefficient model.Numerical simulation results show that when the index variable has a more complicated influence mechanism on the relationship between the independent variable and the dependent variable,and the dimensionality of the independent variable is relatively high,the SCAD-VICM model can better restore the real model and reduce the estimation error.Besides,based on the theory of mode,we choose the optimal quantile levels and then propose the weighted optimal quantile regression(MWQR)method.Compared with the existing estimation methods,the MWQR method on the one hand solves the problem of subjectivity of choosing quantile levels in traditional quantile regression methods,on the other hand.the coefficient estimates under different quantiles are given proper weights in our algorithm,which greatly improve the efficiency of the estimates.In addition,we apply our MWQR method to a classical semi-parametric model—partially linearly additive models,then a robust estimation(MWQR-PLAM)and a variable selection(PMWQR-PLAM)process for partially linearly additive models are proposed,the asymptotic properties of the estimator are also proved.Numerical simulation results show the superiority of the proposed method in parameter estimation and variable selection,especially when the error terms are thick-tailed.Finally,the proposed method is applied to the case study of "implicit guarantee" of urban investment bonds and plasma β-carotene concentration problems,which further demonstrates the robustness and wide applicability of the proposed method.Secondly,this paper considers a Model-Free ultra-high dimensional interaction effect screening method—Ⅵ,and proves related theoretical properties of the algorithm,such as sure screening property and ranking consistency property.The VI method has the following advantages:①It does not rely on any model structure,so it can be used to explore nonlinear and complex interaction effects;②It can capture categoryspecific and category-general interaction effects,which can be applied to more in practical applications Scenario;③By applying the slicing method,the VI algorithm can be extended from simple two-class classification and multi-classification problems to interaction screening problems under continuous response variables or even count data.At the same time,the calculation process is very simple and does not require any numerical optimization process.In addition,this article also proposes a main effect screening algorithm-VM,which can be combined with the VI method and produce a two-step interaction effect screening process,the two-stage methods greatly save computing time.We also use the knockoff method to solve the false discovery rate(FDR)controlling issues in main effect screening process.Through different numerical simulations and empirical studies on SRBCT,it is further verified that the VI algorithm proposed in this paper has significant advantages compared with existing algorithms.High-dimensional compositional data with massive rounded zeros and missing values is arising with the fast development of computer technology and brings much challenge.The thick-tail and complicated covariance structure make the analysis more difficult,thus exploring robust methods for imputation of rounded zeros in highdimensional compositional data is focused.To this end,as an extension of variable select methodology,a robust method(SubLQR)based on modified EM algorithm is proposed,combining R-type clustering and Lasso-Quantile regression.The proposed SubLQR is superior to others in the following aspects:(1)Robustness:with the application of Lasso-Quantile regression,a sparse pattern is provided;(2)Efficiency:with the use of R-type clustering,computation cost is reduced and precision is improved.Simulation results suggest that the proposed method performs better than existing methods,especially when the percentage of zeros is large and outliers occurs.The results also indicate that the proposed method greatly shorten the running time,especially under high-dimensional conditions.Finally,real data analysis in rare metabonomics field indicates the wide applicability of the proposed SubLQR.

Keywords/Search Tags:

High-dimensional and ultra-high dimensional data, Parametric and semiparametric models, Variable selection, Interaction screening, Compositional data

PDF Full Text Request

Related items

1	Robust Estimation And Variable Selection Of Two Kinds Of Semi-parametric Models Under High Dimension Data
2	Some Studies On Feature Screening Of Ultra-high-dimensional Longitudinal Data And Group Structured Data
3	Research On Feature Selection Method Without Model Constraints Under Ultra High Dimensional Data
4	Bayesian Variable Selection in Parametric and Semiparametric High Dimensional Survival Analysis
5	Variable Selection And Feature Screening In High-dimensional Data
6	Adaptive Variable Screening For Ultra-High Dimensional Heterogeneous Data
7	Structure Identification,Variable Selection And Robust Estimation For Some Semiaparametric Models With High Dimensional Complicated Data
8	Methods And Theories For Semiparametric Regression Models With Complex Data
9	Research On Variable Screening Of Ultra-high Dimensional Categorical Data Based On Relative Entrop
10	Empirical Likelihood Inference For High-Dimensional Data With A Diverging Number Of Parameters