Font Size: a A A

Research On Several Feature Screening Problems Under Ultrahigh Dimensional Linear Models

Posted on:2020-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:N ZhangFull Text:PDF
GTID:1360330602954659Subject:Financial mathematics and financial engineering
Abstract/Summary:PDF Full Text Request
During the past decades,the cost of data collection and storage has been signifi-cantly decreased thanks to the rapid development of information technology.Therefore,scientists are confr-onted with unprecedentedly massive data in various scientific field-s,such as genomics,economics,signal and image processing and earth sciences,etc.How to extract key information from complex high dimensional data under the distur-bance from large amount of redundant data becomes a great challenge for statisticians.When it comes to high dimensional regression,such a problem can be described as how to effectively and efficiently identify all the active predictors(predictors with non-zero coefficients)under the assumption that most predictors are inactive(with zero coefficients).However,for high dimensional regression models where the number of predictors p far exceeds the sample size n,many classical statistical methods,such as ordinary least squares,maximum likelihood estimation and others,fail to work due to the rapid expansion of the predictor dimension.To deal with high dimensional data,stat isticians have carried out extensive research on dimension reduction techniques in the last twenty years,which can be divided into two main categories:variable selection methods using penalized loss functions and feature screening methods aiming to effectively reduce the predictor dimension.Through solving optimization problems,variable selection methods could conduct parameter estimation and dimension reduction simultaneously.Nevertheless,the exponentially fast expanding predictor dimension can significantly increase the computational cost of solving optimization problems and lead to the in-consistency of many variable selection techniques.To further enhance the accuracy and efficiency of dimension reduction,statisticians started the research on feature screening techniques,aiming to reduce the predictor dimension to a manageable size such that variable section methods can be implemented smoothly afterwards.In this thesis,we focus on the feature screening problems under ultrahigh dimen-sional linear models and make following contributions.Initially,we polish the theory of sure independence screening introduced by Fan&Lv(2008)by proving the sure screen-ing properties of several types of iterative screening methods based on the theory.Then we introduce a new conditional feature screening method to take advantage of prior information concerning certain active predictors to further improve the screening ac-curacy.Finally,inspired by the classical forward regression(FR)method,we propose a new iterative feature screening method based on the conditional feature screening method introduced previously,which could not only work effectively with the help of prior information,but also have promising performances utilizing a data-driven condi-tioning set when no prior knowledge is available.Our research can be summarized as follows.1.There has been an explosion in the development of feature screening methods since Fan&Lv(2008)proposed the path-breaking SIS(sure independence screening)method,which has been widely applied in various fields.The success of SIS can be attributed to the following two facts.One is its relatively low computational cost com-pared to solving large-scale optimization problems.Most importantly,under certain assumptions,SIS could preserve all active predictors with an overwhelming probability,which is referred to as the sure screening property.However,Fan&Lv(2008)showed that the sure screening property of SIS depends on the marginal correlation assump-tion,requiring the marginal correlations between active predictors and the response to be bounded away from zero.Due to correlations among predictors,the marginal correlation assumption can be easily violated in the high dimensional scenarios,which results in the underperformance of SIS in such situations.To overcome this problem,Fan&Lv(2008)proposed the ISIS(iteratively sure independence screening)algorith-m,which iterate SIS on residual vectors to diminish the influence from correlations among predictors.Fan et al.(2009)further improved ISIS and proposed the Van-ISIS(vanilla ISIS)for generalized linear models based on marginal loss functions.Even though ISIS and Van-ISIS have outstanding performances in both simu-lation studies and real data analyses,their sure screening properties have not been theoretically verified during the past decade.The sure screening property is a main consideration when evaluating a feature screening method,since it ensures that the upcoming variable selection procedure can be implemented on the model that includes all active predictors.In the first part of this thesis,we prove the sure screening prop-erties of three types of iterative screening algorithms under reasonable assumptions,where ISIS and Van-ISIS can be regarded as two special cases of them and thus their sure screening properties can be obtained directly from these results.Our work fills the theoretical gap that exits for more than ten years and provides theoretical support for the wide applications of ISIS and Van-ISIS in the future.It is also noteworthy that,from our results,we can obtain a sharper result concerning the sure screening property of FR under more general assumptions compared to that in Wang(2009).2.In various scientific research,scientists can frequently obtain the prior knowl-edge concerning certain active predictors from previous studies.How to utilize such prior information to enhance the screening accuracy is an extremely interesting and valuable topic.To handle such prior information,Barut et al.(2016)proposed the CSIS method(conditional sure independence screening)to identify remaining active predic-tors based on the known ones.CSIS selects predictors through ranking the conditional contributions of remaining predictors to the response.Barut et al.(2016)proved the sure screening property of CSIS under the conditional linear covariance assumption,requiring the conditional linear covariances between remaining active predictors and the response conditioning on known active predictors to be bounded away from zero.Nevertheless,similar to the marginal correlation assumption,the conditional linear covariance can also be violated due to high correlations among predictors and thus CSIS may break down as a consequence.To improve such situations,based on the HOLP(high dimensional ordinary least squares projection)method introduced by Wang&Leng(2016),we propose a new conditional screening method named as conditional screening via ordinary least squares projection(COLP).HOLP is an efficient feature screening methods designed for linear models and it can conduct the screening through ranking the estimates constructed by the Moore-Penrose inverse of the design matrix.However,HOLP is not able to utilize any prior information and its sure screening property relies on the upper bound of ||?||,where ? denotes the coefficient vector and ||?|| denotes its L2 norm.Equivalently,when ||?|| is sufficiently large,the sure screening property of HOLP may no longer hold.COLP initially projects the design matrix onto the orthogonal complement of the column space of conditioning active predictors,and then estimates the remaining coef-ficients with a diagonally dominant matrix constructed by the Moore-Penrose inverse of the projected design matrix.As a result,COLP could eliminate the negative impact from coefficients of conditioning active predictors on the estimation of remaining ones and thus significantly improves the accuracy of estimation.We prove the sure screening property of COLP without relying on the conditional linear covariance assumption nor the restriction on the upper bound of ||?||.Therefore,COLP could preserve all the re-maining active predictors with an overwhelming probability even when some remaining active predictors have close-to-zero conditional linear covariance with the response or cofficients of some known active predictors are of large absolute values.Through the comparison with other screening methods,we demonstrate the effectiveness of COLP in extensive numerical studies.3.From the simulation studies,we can see that COLP achieves the best perfor-mance when the conditioning set includes all the significant active predictors(active predictors with coefficients of large absolute values).However,in real world applica-tions,it is usually impossible for researchers to obtain such informative prior knowl-edge.Consequently,even though COLP could eliminate the impact from coefficients of known active predictors,large coefficients of unidentified active predictors can keep influencing the estimation of remaining coefficients.To further avoid the influence from hidden active predictors,we propose a new iterative feature screening method named as forward screening via ordinary least squares projection(FOLP).Through iterating COLP,FOLP can eliminate the possible impact from coefficients of selected predictors progressively.Similar to FR,FOLP adds new predictors to the selected model one by one through comparing their RSS.FOLP is more computationally efficient,since it only considers two candidate models in each step,whereas FR has to evaluate the RSS of all the remaining predictors.Most importantly.FOLP also works effectively applying a data-driven conditioning set when no prior information is available.Compared to COLP,FOLP could further enhance the screening accuracy when the prior information fails to include all the significant active predictors and has outstanding performances in extensive numerical studies regardless of the availability of prior information.More-over,the effectiveness of FOLP is also examined in the analysis of a leukemia dataset,where FOLP achieves zero training error and zero testing error in combination with the naive Bayes rule.
Keywords/Search Tags:Variable selection, Iterative feature screening, Sure screening property, Conditional feature screening, Forward regression
PDF Full Text Request
Related items