Font Size: a A A

Research On Algorithm Of Generalized Linear Model Under Massive Data

Posted on:2021-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:S D ChenFull Text:PDF
GTID:2480306602476764Subject:Mathematics
Abstract/Summary:PDF Full Text Request
The generalized linear model can be directly generalized by the linear regression model.This model allows the dependent variable to follow the exponential family distribution,so it has a wider application scenario than the linear regression model.In order to apply the model to actual scenarios,parameter estimation and variable selection have become two important research directions of generalized linear models.However,the research of previous scholars mainly focused on small samples.With the rapid development of data collection technology,massive data has become a research trend.Although massive data is an opportunity for problem research,it also poses certain challenges.Compared with the small sample case,how to complete the calculation task under the premise of limited computer configuration and how to complete the calculation task efficiently will become the difficulty of the massive data situation.Divide and conquer algorithm is an effective method to solve these difficulties,and it has been gradually applied to statistical analysis.However,the current research on this aspect is still in its preliminary stage,and the problems of parameter estimation and variable selection of complex models under massive data still need to be further studied.In the first part of this paper,we study the estimation method of the parameters to be estimated by the generalized linear model in the case of massive data.In order to avoid using the overall observation data in each step of the iteration process,this paper combines the Newton-Raphson algorithm with the divide and conquer algorithm,and proposes an aggregate quasi-likelihood estimation algorithm that can estimate the parameters of the generalized linear model in single-machine mode and distributed mode,The aggregation algorithm reduces the need for computer configuration through the idea of divide and conquer.In terms of asymptotic properties,it is proved that when the relationship between the number of blocks and the overall observation satisfies certain conditions,the aggregate estimator based on the aggregate quasi-likelihood estimation algorithm is asymptotically equivalent to the estimator directly estimated from the overall observation data.The numerical simulation is carried out from the stand-alone environment and the cluster environment.The experimental results show that the use of the aggregation estimation algorithm can reduce the overhead of computer memory,and it is suitable for the Spark distributed cluster environment.The second part of this paper studies the variable selection method of the generalized linear model in the case of massive data.First,at the ordinary data scale,the variable selection method of the Logistic model with respect to SCAD penalty and MCP penalty is generalized to the generalized linear model with a general connection function.Secondly,combined with the divide and conquer algorithm to further promote the variable selection method,making it suitable for massive data environment.The algorithm is numerically simulated in two ways:single-machine environment and cluster environment.The experimental results show that divide and conquer reduces the computer's memory overhead,improves the calculation efficiency,and can further improve the calculation efficiency in a distributed parallel method.These results verify that the algorithm is feasible and effective in practical applications.
Keywords/Search Tags:massive data, generalized linear model, divide and conquer algorithm, aggregate estimation, parallel algorithm
PDF Full Text Request
Related items