Research On Algorithm Of Generalized Linear Model Under Massive Data

Posted on:2021-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:S D Chen

Full Text:PDF

GTID:2480306602476764

Subject:Mathematics

Abstract/Summary:

PDF Full Text Request

The generalized linear model can be directly generalized by the linear regression model.This model allows the dependent variable to follow the exponential family distribution,so it has a wider application scenario than the linear regression model.In order to apply the model to actual scenarios,parameter estimation and variable selection have become two important research directions of generalized linear models.However,the research of previous scholars mainly focused on small samples.With the rapid development of data collection technology,massive data has become a research trend.Although massive data is an opportunity for problem research,it also poses certain challenges.Compared with the small sample case,how to complete the calculation task under the premise of limited computer configuration and how to complete the calculation task efficiently will become the difficulty of the massive data situation.Divide and conquer algorithm is an effective method to solve these difficulties,and it has been gradually applied to statistical analysis.However,the current research on this aspect is still in its preliminary stage,and the problems of parameter estimation and variable selection of complex models under massive data still need to be further studied.In the first part of this paper,we study the estimation method of the parameters to be estimated by the generalized linear model in the case of massive data.In order to avoid using the overall observation data in each step of the iteration process,this paper combines the Newton-Raphson algorithm with the divide and conquer algorithm,and proposes an aggregate quasi-likelihood estimation algorithm that can estimate the parameters of the generalized linear model in single-machine mode and distributed mode,The aggregation algorithm reduces the need for computer configuration through the idea of divide and conquer.In terms of asymptotic properties,it is proved that when the relationship between the number of blocks and the overall observation satisfies certain conditions,the aggregate estimator based on the aggregate quasi-likelihood estimation algorithm is asymptotically equivalent to the estimator directly estimated from the overall observation data.The numerical simulation is carried out from the stand-alone environment and the cluster environment.The experimental results show that the use of the aggregation estimation algorithm can reduce the overhead of computer memory,and it is suitable for the Spark distributed cluster environment.The second part of this paper studies the variable selection method of the generalized linear model in the case of massive data.First,at the ordinary data scale,the variable selection method of the Logistic model with respect to SCAD penalty and MCP penalty is generalized to the generalized linear model with a general connection function.Secondly,combined with the divide and conquer algorithm to further promote the variable selection method,making it suitable for massive data environment.The algorithm is numerically simulated in two ways:single-machine environment and cluster environment.The experimental results show that divide and conquer reduces the computer’s memory overhead,improves the calculation efficiency,and can further improve the calculation efficiency in a distributed parallel method.These results verify that the algorithm is feasible and effective in practical applications.

Keywords/Search Tags:

massive data, generalized linear model, divide and conquer algorithm, aggregate estimation, parallel algorithm

PDF Full Text Request

Related items

1	Regression Under Massive Data Estimation Algorithm Of Quantile
2	Iterative Divide-and-conquer Method Of Estimating Index Coefficients In Single-index Model Under Massive Data
3	Research On Algorithm Of Mixed Effect Regression Model Under Massive Data
4	The Research Of Divide And Conquer Algorithms For Skew-symmetric Tridiagonal Eigenvalue Problems
5	Research On Divide And Conquer Algorithms For Complex Electromagnetic Problems
6	Distributed Empirical Likelihood Estimation In Massive Data
7	Adaptive Quantile Regressions For Massive Datasets
8	On-diagonal Matrix Moore-penrose Inverse Parallel Computing
9	A New Algorithm Solving Eigenvalue Problem Of Matrices
10	Some Discussions On Generalized Linear Models With Missing Data