Some Studies On Methods And Theories Of Statistical Analysis For Massive Datasets

Posted on:2024-09-19

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Gao

Full Text:PDF

GTID:1527307070960529

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

The rapid emergence of large-scale datasets has brought unprecedented opportunities to modern statistics.However,at the same time it has posed significant challenges to traditional statistical computation methods.For instance,when the size of the dataset exceeds the computer’s memory,or when the dataset is distributed across different computing servers,traditional computation methods may be difficult to directly implement.Therefore,it is necessary to design computationally efficient and theoretically guaranteed algorithms that take into account both the characteristics of the dataset and the computing resources available to users.This also becomes an increasingly important issue in the field of statistics in recent years.In this thesis,we consider three important methods for large-scale statistical computation,namely subsampling methods,stochastic gradient descent type methods,and distributed computing methods.The main contents are outlined as follows.Subsampling methods are very useful for large-scale statistical analysis,especially when the available computing resources are extremely limited.The first topic of this thesis focuses on the asymptotic properties of a subsampling-based bagging estimator for large-scale problems.Specifically,we investigate the asymptotic variance and bias of the bagging estimator for M-estimation problems and establish its asymptotic normality under appropriate conditions.The results show that the bagging estimator can achieve the optimal statistical efficiency,provided that the bagging subsample size and the number of subsamples are sufficiently large.Moreover,we derive a variance estimator to facilitate further statistical inference.All theoretical findings are further verified by extensive simulation studies.Sometimes,not only the sample size of the whole dataset is very large,but also the number of model parameters can be very high.In such situations,the classical Newton-Raphson algorithm could be highly inefficient or even infeasible in practice.For this problem,the minibatch-based stochastic gradient descent type algorithms are commonly used.However,we find that their theoretical properties are still underexplored.Therefore,our second research topic is to investigate the numerical convergence and statistical properties of the minibatch-based gradient descent with momentum(MGDM)under the linear regression models.First,we study the numerical convergence properties of one type of MGDM algorithm and provide insights into the impact of two tuning parameters(i.e.,the learning rate and momentum parameter)on the numerical convergence rate.The results reveal the acceleration effect of the momentum term and provide suggestions on how to specify the tuning parameters for faster numerical convergence.In addition,we explore the relationship between the statistical properties of the resulting MGDM estimator and the tuning parameters.Based on these theoretical findings,we give the conditions for the resulting estimator to achieve the optimal statistical efficiency.In certain scenarios,the massive volume of the dataset could exceed the storage and computing power of a single machine,or the dataset is naturally distributed across different servers.In such cases,a distributed computing system is required to analyze and process the data.Therefore,our third research topic is the distributed kernel smoothing and prediction based on a novel grid point approximation(GPA)method.We find that the existing one-shot(OS)type kernel smoothing method has high computation and communication costs in prediction problems.To address this issue,we propose a distributed estimator based on the grid point approximation method,namely GPA estimator.Once trained,the GPA estimator requires no communication cost and has extremely low computation cost during the prediction phase.Theoretically,we prove that the GPA estimator can achieve the same statistical efficiency as the whole sample estimator under mild conditions.Furthermore,we provide two distributed bandwidth selectors suitable for different scenarios with theoretical guarantees.

Keywords/Search Tags:

Large-scale computing, subsampling method, stochastic gradient descent, distributed computing, numerical convergence, bandwidth selection, communication efficiency

PDF Full Text Request

Related items

1	Several Researches On Classification Learning And Its Distributed Computing Methods And Theories
2	A Bayesian Stochastic Gradient Descent Method
3	Research And Application Of Online Education System Based On Distributed Cloud Computing
4	Research On Optimization Method Of Statistical Model Based On Natural Gradient
5	Research And Application Of A Constructive Federated Learning Method
6	Strong Convergence And Stability Of Numerical Methods For Stochastic Differential Equations With Non-globally Lipschitz Continuous Coefficients
7	Heun Method For A Class Of Stochastic Distributed Delay Systems
8	Design And Implemention Of Population Information Collection System Based On Cloud Computing
9	Research On Improved Stochastic Stratified Average Gradient Algorithm In Convolutional Neural Network
10	Research On Application Of Cooperative Learning In Secondary Vocational School Computer Lessons Based On Cloud Computing