Font Size: a A A

Research On Distributed Estimation Algorithms For Generalized Linear Models And Quantile Regression

Posted on:2022-05-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y FanFull Text:PDF
GTID:1480306494970359Subject:Mathematical statistics
Abstract/Summary:PDF Full Text Request
With the continuous development of science and technology,the era of big data has quietly come.Distributed storage is one of our commonly used methods for solving the storage problem of big data.For distributed big sample data,the traditional centralized estimation algorithms are usually computationally inflexible or even infeasible,as they require transmitting all data subsets distributively stored on the local machines to a central machine before conducting the estimation,especially when the storage and computing capabilities of the central machine are insufficient,or the samples are highly confidential so that transmitting raw data is not allowed.In this background,it is very necessary for us to develop effective and computationally efficient distributed estimation algorithms for common regression models in statistical research.This paper mainly focuses on studying the distributed algorithm for several common estimation problems in generalized linear models(GLMs)and linear quantile regression,which have wide applications in reality.Besides,this paper also includes some discussions on the parallel computation for weighted quantile regression estimator in longitudinal big data.Specifically,we have:(1)The distributed computation for adaptive lasso estimator in GLMs.In GLMs,the regularization paths of adaptive lasso and further the optimal adaptive lasso estimator are often computed by using the R package “glmnet”.Essentially,the glmnet is a coordinate descent based centralized algorithm,and in distributed big data,it is usually computationally inflexible,and the estimation speed is relatively slow.In Chapter 3 of this paper,we propose a distributed adaptive lasso estimation method,QAGLM-alasso,by utilizing the distributed quadratic approximation representation of GLMs,and further develop a path-following algorithm,QAGLM-LARS,for QAGLM-alasso estimator based on least angle regression(LARS).Theoretical analyses show that,under mild regularity conditions,the QAGLM-alasso estimator is asymptotically equivalent to the original adaptive lasso estimator.Simulation data and real data analyses demonstrate that the QAGLM-LARS algorithm has similar model selection and estimation accuracy with the classic glmnet,and shows an advantage over glmnet in computational efficiency under distributed environments.(2)The distributed computation for nonconvex penalized estimator in GLMs.The nonconvex penalized GLMs are a commonly used tool for analyzing high-dimensional sparse data with non-normal distribution and/or nonlinear response-covariate relationship in reality.Its estimation problem under the nonconvex penalties,SCAD and MCP,is usually solved by using the R package “ncvreg”.The ncvreg is also a coordinate descent based centralized algorithm,and in distributed big data,it also has the problems of inflexible computation and slow estimation speed.In Chapter 4 of this paper,we propose a distributed computing method,QAGLM-NC,for nonconvex penalized estimator,by utilizing the distributed quadratic approximation representation of GLMs,and further develop a parallel algorithm,QAGLM-ADMM,for the computation of QAGLM-NC estimator based on the alternating direction method of multipliers(ADMM).Under common nonconvex penalties,SCAD and MCP,all the ADMM update problems in this parallel algorithm have closed-form solutions.Theoretical analyses show that,under mild regularity conditions,the objective optimized in QAGLM-NC has a consistent local minimum point,and this local minimum point enjoys the oracle property.Further,it is asymptotically equivalent to the consistent local minimum point of the original objective of nonconvex penalized estimation problem.Simulation data and real data analyses demonstrate that,under distributed environments,the QAGLM-ADMM algorithm has favorable model selection and estimation accuracy similarly as the classicncvreg,but is usually faster than ncvreg.(3)The parallel computation for nonconvex penalized linear quantile regression estimator.For small or moderate-size sample data,the estimation problem of nonconvex penalized linear quantile regression is suitable to be solved by the QICD algorithm.The advantage of QICD is that it enjoys high estimation accuracy,but essentially it is a coordinate descent based double-loop algorithm,and thus may has the problem of slow computational speed in big sample data.The recent parallel algorithm,QPADM,proposed for big sample data based on ADMM,has an obvious improvement over QICD in computational efficiency under the premise of ensuring favorable estimation accuracy.But the deficiency of QPADM is that it has relatively slow convergence speed,and usually requires several hundreds of iterations to achieve convergence.This is a disadvantage for distributed environments with expensive communication cost.In Chapter 5 of this paper,we develop a new parallel algorithm,QPADM-slack,for the computation of nonconvex penalized linear quantile regression estimator based on the ADMM framework,through introducing some suitable auxiliary variables,among which are two sets of slack variables that help to convert the nonsmooth check loss function in the original estimation problem into a linear function.For common nonconvex penalties,SCAD and MCP,all the ADMM update problems in the QPADM-slack algorithm have closed-form solutions.Simulation data and real data analyses demonstrate that,QPADM-slack performs similarly as QPADM in model selection and estimation accuracy under both non-distributed and distributed environments,and has an improvement over QPADM in convergence speed.(4)The parallel computation for weighted quantile regression estimator in longitudinal big data.Longitudinal data usually have massive volume and high dimensionality,and the observations from the same subject are correlated.These properties impose further challenges to the analysis and computation of quantile regression problem.For longitudinal data,the traditional linear quantile regression often has relatively low estimation efficiency due to completely ignoring the dependence in the data.The weighted quantile regression can effectively improve the estimation efficiency through adding a set of weights that are informative about the within-subject correlations to the model.In Chapter 6 of this paper,we employ weighted quantile regression to model longitudinal data,and develop a two-stage parallel method to support the estimation computing of weighted quantile regression in distributed longitudinal big data.In the first stage,we give a distributed computing method for the weight estimation in weighted quantile regression by utilizing the Newton-Raphson algorithm.And in the second stage,we further develop a parallel algorithm,WQR-ADMM,for solving the weighted quantile regression estimation problem with known weights,based on the ADMM framework.Simulation data and real data analyses show that,the parallel computing method proposed in Chapter 6 of this paper enjoys similar estimation accuracy with the traditional interior point based centralized algorithm under both non-distributed and distributed environments,and shows an advantage in computational efficiency.
Keywords/Search Tags:Distributed big data, Generalized linear models, Quantile regression, Parallel algorithm, ADMM
PDF Full Text Request
Related items