Font Size: a A A

The Parallel Computation Of Expectile Model Under Massive Data

Posted on:2021-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S SongFull Text:PDF
GTID:1368330632953435Subject:Mathematical Statistics
Abstract/Summary:PDF Full Text Request
With the development of science and technology,the available data resources have exploded,so the concept of massive data has been born.The emergence of massive data has challenged traditional statistical analysis methods and computing tools,which has aroused academia and industry wide attention.An important feature of massive data is that the amount of data is very large,some reaching millions or even billions.The computing time,memory requirements and information interaction required to processing data of this magnitude put extremely high demands on the performance of a single computer.Therefore,in order to solve the above-mentioned challenges,three mainstream algorithms have emerged recently,namely subsampling method,online updating algorithm and “divide and conquer” algorithm.Each of the three algorithms has advantages,and the related literature is also very detailed.Among them,the subsampling algorithm,as the name implies,is to extract a small subsample from all original data,so that it can be quickly calculated on a single computer.However,the data information waste caused by this algorithm is also inevitably.The online update algorithm is aimed at streaming data.When data sources are continuously generated,the online updating algorithm has the advantages of saving memory and fast calculation.The “divide and conquer” algorithm is for general massive data.The simple “divide and conquer” algorithm randomly partitions all the original data into some sub-machines,and complete the corresponding calculations on each sub-machine,then return the results to a common machine,and finally integrate those results on this common machine to get the final estimate where the common integration method is average.The key advantages of the “divide and conquer” algorithm are the fast calculation speed and effective information interaction.The three main parts of this thesis are discussed under the framework of the “divide and conquer” algorithm.With the increase of data volume,the complex characteristics such as nonlinearity and heterogeneity contained in the data are becoming more and more obvious,so the relevant analysis tools based on the traditional mean regression model are difficult to meet the needs of massive data analysis.However,as a classic model of the distribution characteristics of the interrelation between variables,the expectile regression model has become the focus of attention.The reasons why expectile model is chosen instead of quantile model are mainly derived from the sensitivity to tail information,that is,it considers not only the tail probability but also the extreme value.This is more suitable for those studies that focus on tail information modeling,because the expectile model uses more tail information in available data.The first model we focus on is the linear expectile model.A common method for parameter estimation in linear models is asymmetric least square(ALS)estimation.The implementation of ALS estimation is an iterative weighted least squares algorithm.Since the entire estimation process involves iterative operations,it is obviously infeasible to directly use the “divide and conquer” algorithm.Therefore,the method proposed in this part is to complete the ALS estimation on each sub-machine,and then return the estimation results to the common machine,and the reminding focus is how to integrate these results to get an effective integrated estimate.According to the fact that the ALS estimates obtained on each sub-machine are asymptotically normal,and inspired by confidence distribution method in the meta-analysis,we established the corresponding joint confidence density function to determine the final integration method.The interesting thing is that the final integration method is similar to the form obtained in other literatures,but based on different assumptions.The second focus of this thesis is still discussed in the framework of linear model.Since “divide and conquer” algorithm can only calculate the parameter estimates at a certain expectile level,it cannot obtain the estimation of entire parameter curve.So we proposed a two-step projection algorithm.First,divide the support set of expectile level into several parts,and take the corresponding equidistant grid points.Then use the “divide and conquer” algorithm to calculate the parameters on all equidistant grid points.It follows that those obtained estimator are taken as the response variables and the corresponding expectile levels as the explanatory variables,then using B-spline method to obtain the curve estimate for each dimension.Because the real expectile curve increases with the expectile level,we constructed,therefore,a B-spline estimation process with constraints,which can be turned into a quadratic programming or linear programming problem to deal with.The third focus of this thesis is varying-coefficient expectile model.Using kernel smooth technique,the parameter estimation problem of the vaying-coefficient expectile model is transformed into a general form of ALS estimator.However,the classic bandwidth selection method(cross-validation method)involves multiple iteration and repeated partition of all original data,plus the iterative weighted least squares algorithm involved in the ALS estimator,so the entire process is very time-consuming and tedious.combined with the characteristics of the model,we use some “summary statistics” to represent all original data in each sub-machine,which greatly simplifies the calculation process and ensures effective information interaction.
Keywords/Search Tags:Massive data, “Divide and conquer”algorithm, Linear expectile model, Confidence distribution, Two-steps projection algorithm, Varying-coefficient expectile model, Summarized statistics, Cross validation
PDF Full Text Request
Related items