The rapid emergence of large-scale datasets has brought unprecedented opportunities to modern statistics.However,at the same time it has posed significant challenges to traditional statistical computation methods.For instance,when the size of the dataset exceeds the computer’s memory,or when the dataset is distributed across different computing servers,traditional computation methods may be difficult to directly implement.Therefore,it is necessary to design computationally efficient and theoretically guaranteed algorithms that take into account both the characteristics of the dataset and the computing resources available to users.This also becomes an increasingly important issue in the field of statistics in recent years.In this thesis,we consider three important methods for large-scale statistical computation,namely subsampling methods,stochastic gradient descent type methods,and distributed computing methods.The main contents are outlined as follows.Subsampling methods are very useful for large-scale statistical analysis,especially when the available computing resources are extremely limited.The first topic of this thesis focuses on the asymptotic properties of a subsampling-based bagging estimator for large-scale problems.Specifically,we investigate the asymptotic variance and bias of the bagging estimator for M-estimation problems and establish its asymptotic normality under appropriate conditions.The results show that the bagging estimator can achieve the optimal statistical efficiency,provided that the bagging subsample size and the number of subsamples are sufficiently large.Moreover,we derive a variance estimator to facilitate further statistical inference.All theoretical findings are further verified by extensive simulation studies.Sometimes,not only the sample size of the whole dataset is very large,but also the number of model parameters can be very high.In such situations,the classical Newton-Raphson algorithm could be highly inefficient or even infeasible in practice.For this problem,the minibatch-based stochastic gradient descent type algorithms are commonly used.However,we find that their theoretical properties are still underexplored.Therefore,our second research topic is to investigate the numerical convergence and statistical properties of the minibatch-based gradient descent with momentum(MGDM)under the linear regression models.First,we study the numerical convergence properties of one type of MGDM algorithm and provide insights into the impact of two tuning parameters(i.e.,the learning rate and momentum parameter)on the numerical convergence rate.The results reveal the acceleration effect of the momentum term and provide suggestions on how to specify the tuning parameters for faster numerical convergence.In addition,we explore the relationship between the statistical properties of the resulting MGDM estimator and the tuning parameters.Based on these theoretical findings,we give the conditions for the resulting estimator to achieve the optimal statistical efficiency.In certain scenarios,the massive volume of the dataset could exceed the storage and computing power of a single machine,or the dataset is naturally distributed across different servers.In such cases,a distributed computing system is required to analyze and process the data.Therefore,our third research topic is the distributed kernel smoothing and prediction based on a novel grid point approximation(GPA)method.We find that the existing one-shot(OS)type kernel smoothing method has high computation and communication costs in prediction problems.To address this issue,we propose a distributed estimator based on the grid point approximation method,namely GPA estimator.Once trained,the GPA estimator requires no communication cost and has extremely low computation cost during the prediction phase.Theoretically,we prove that the GPA estimator can achieve the same statistical efficiency as the whole sample estimator under mild conditions.Furthermore,we provide two distributed bandwidth selectors suitable for different scenarios with theoretical guarantees. |