Font Size: a A A

Robust Variable Selection For Constrained High-dimensional Model And Classification Under Distribution Heterogeneity

Posted on:2020-04-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X LiuFull Text:PDF
GTID:1360330602956802Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Variable selection and classification are hot topics in statistical analysis and machine learning,which are widely used in many scientific research and application fields,such as medical diagnosis,genomic research,financial risk,wireless communications and etc.High-dimensional models are usually assumed to be sparse,where only few predictors contribute to the response variables.Variable selection is designed to select importan-t predictors and estimate the corresponding coefficients.Classification methods solve the problem of identifying to which of a set of categories a new observation belongs,based on a training set of data containing observations whose category membership is already known.Although there are a lot of variable selection and classification meth-ods nowadays,they are not applicable or ineffective on data with heavy-tailed errors,large outliers,and heterogeneous distribution.In addition,there may exist assumption or subject-matter information about the relationship between a response and predictor variables in practical settings,which can be incorporated as constraints on parameters.In this thesis,we investigate two types of problems.The first one is related to robust variable selection for heavy-tailed data with linear constraints of parameters.The second one is the classification for heterogencous data in one class.The thesis is divided into five chapters.In Chapter One,we briefly introduce some basic knowledge about the variable selection method Lasso and their variants,degrees of freedom,quantile regression,Huber regression and traditional classification methods.In Chapters Two and Three,we pay attention to robust variable selection for the constrained high-dimensional model.Chapter Two presents a generalized l1-penalized quantile regression with linear constraints.Chapter Three proposes a regularized regression with Huber loss and linear constraints.Chapter Four studies a classification method with minimum ambiguity for heterogeneous data in one class.Chapter Five contains some conclusions and further discussions.Next,we introduce the main parts of the thesis.Chapter Two:We study the penalized quantile regression with constrains on parameters for high dimensional models.Quantile regression(QR)estimates the condi-tional quantile function of the response and is able to provide a comprehensive picture of how a response depends on predictors.Quantile loss function is less sensitive to data and thus penalized QR is more robust than Lasso.In some practical applications,linear equality or linear inequality constraints on the parameters can be formulated in terms of prior knowledge,which can further improve the performance of variable selection and estimation.Examples of linearly constrained lasso problems include recurrent neural network(Xia and Wang,2005),portfolio selection(Fan et al.,2012),shape-restricted non-parametric regression(Wang and Ghosh,2012)and etc.For observations {(xi,yi),i=1,...,n},where xi ?Rp is a vector of predictors and yi ?R is the response,we propose the linearly constrained regularized quantile regression,that is,where ??(·)is the quantile loss function,??0 is a tuning parameter,?·? is the l1-norm of a vector,D?Rm×p,C?Rq×p,,d?Rq,E?Rs×p,and f?Rs are constant matrices or vectors specified by users according to assumptions or subject matter knowledge in the application.The usual lasso(Tibshirani,1999),fused lasso(Zou,2005)and adaptive lasso(Zou,2006)are special cases of the above problem with proper choices of D.We show the Karush-Kuhn-Tucker(KKT)conditions of the optimization problem and define the index sets?={i:yi-?-?Txi=0},A={k:Dk?=0},B={j:Cj?=dj}.It is obvious that the solution a and ? depend on the value of ?.There exist transition points ?k such that the solution jumps at these points and is piecewise constant between any two of them.We propose an efficient algorithm to compute the whole solution path{(?(?),?(?)),0????} by finding each transition point and solution.According to Stein's Lemma,we derive the number of the degrees of freedom for the fit ?=x?+?,df(?)=E(|?|).Two model selection criteria,namely SIC and GACV,are constructed with the above formula of degrees of freedom to select the optimal ?.Simulation studies and real data examples are presented to illustrate the effectiveness of the proposed algorithm and degrees of freedom.Chapter Three:Data subject to heavy-tailed errors and outliers are common-ly encountered in various scientific fields and practical applications.In this case,the ordinary least square estimator is reputed to be not efficient.Moreover,when data is asymmetric,the mean regression function can not be estimated by quantile regression.To overcome this problem,Huber(1981)describes a robust loss function that is quadrat-ic in small values but grows linearly for large values.We propose a regularized Huber regression with linear constraints of the parameters for high-dimensional model which is set as follows:where HM(·)is the Huber loss function and the definition of A,D,C,E,d,f is the same as in problem(1).It is critical to choose the optimal tuning parameter ? in high-dimensional models.The formula of degrees of freedom is needed to construct the model selection criterion to choose ?.As far as we know,there is no existing research on the number of degrees of freedom for regularized Huber-loss regression.We use the Moreau-Yosida regularization of the absolute value function(Hiriart-Urruty and Lemar6chal,1991),then problem(2)is equivalent to where y=(y1,y2,...,yn)T is the vector of responses,X=(X1T,x2T,...,xnT)T is the design matrix and v=(v1,...,vn)T is variable has the same dimension as y.Define the following index sets:V={i:vi?0,|yi-?Txi|>M},A={k:Dk?0},B={j:Cj?=dj}.We derive the KKT conditions for the optimization problem.It can be shown that ? is directly affected only by the data in set Vc,whose residuals fall in[-M,M].While the data in set V affect ? just through the signs sv.It explains why Huber-loss regression is robust to heavy-tailed errors and outliers.An explicit expression of the fit is obtained according to the KKT conditions to compute the formula of degrees of freedom.Denote G-A,B=(D-AT,-CBT,-ET)T,the degrees of freedom of the Huber fit ?=X? are df(?)=E[dim(col(X-vPnull(G-A,B)))],where Pnull is the projection matrix associated with null(G-A.B).Simulation studies show that the performance of model selection criterion with the proposed degrees of freedom is similar to the standard criterion.Real data example illustrate the robustness of the Huber loss function when there are outliers in the response.Chapter Four:The traditional classification methods are based on the assumption that the distribution of indicator variable X in one class is homogeneous.However,het-erogeneous data is faced in many application fields,such as medicine,biological science,gene expression,finance and etc.The heterogeneous distribution of data in classification problem arises from different situations in one class,for example,people who have ane-mia(one class)consist of people with the sexually transmitted disease(STD)(situation 1)and not(situation 2).Since STD can affect the physical conditions of a person,it causes the distribution heterogeneity of X in one class.Lei(2015)proposed the classi-fication method with minimum ambiguity and used the likelihood ratio of data in two classes to construct the classifier and thresholds.There is no unique likelihood ratio of two classes when data are heterogeneous in one class.We overcome this difficulty and generalize the method in Lei(2015)to the case of distribution heterogeneity in one class.For convenience,we assume there are two different situations in one class,denoted as G1 and G2.The error rates in each situation of one class for k=1,2,j=0,1 are designed by users.Denote the two classification regions as CO and C1.We aim to minimize the ambiguity P(C0?C1)with the constraint of high accuracy in each situation of one class.We propose a two-stage procedure to solve the new classification problem.Firstly,we derive the thresholds tk0,tk1 in each situation k,for k=1,2 by the method in Lei(2015)because the distribution of X in one class and each situation is homogeneous.Then we choose the new thresholds t0 and t1 as follows,t0=t10(?10)V t20(a20),t1=t11(?11)?t21(?21).To make full use of all the information,the classifier is constructed by the likelihood ratio of data in two classes in all situations,that is,?(x)=f1(x)/f0(x),where fj is the conditional density of X given Y?j,for j?0,1.Then the classification regions are constructed as follows:C0={x:?(x)?t0} and C1={x:?(x)?t1},We show that the classification accuracy rate of one class is larger than the weighted sum of the accuracy rate in each situation,Pj(Cj)?1j(1-?1j)+?2j(1-?2j),for j=0,1.where ?kj-P{x ?Gk|Y=j} and Pj is the conditional distribution of X given Y=j.Then the proposed classification regions can ensure the high accuracy of one class with the minimum ambiguity under distribution heterogeneity.The unknown densities are estimated by the nonparametric kernel method and then the thresholds and classifier can be estimated.We give the boundary of error between the estimators Pj(Cj)and true values Pj(Cj).Several simulation studies and an analysis of AIDs data are conducted to examine the performance of the proposed classification method.
Keywords/Search Tags:variable selection, linear constraints, robustness, generalized lasso, quantile regression, huber regression, degrees of freedom, classification, distribution heterogeneity
PDF Full Text Request
Related items