Font Size: a A A

Convergence Analysis Of SGD And Ada Grad Algorithms In Nonconvex Settings

Posted on:2024-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:T R XingFull Text:PDF
GTID:2568307112489674Subject:Computational Mathematics
Abstract/Summary:
Gradient-like optimization algorithm is a simple and efficient method for solving large-scale optimization problems,which has been widely used in machine learning.These algo-rithms minimize a loss function by iteratively adjusting model parameters so that the model can better fit the training data.Among them,stochastic gradient descent(SGD)is a very com-mon optimization algorithm in neural network model training.Its main idea is to use part of the training samples to calculate the gradient in each iteration,thereby reducing the calculation cost and speeding up the convergence speed.When analyzing the convergence of SGD,it is usually assumed that the stochastic gradient is an unbiased estimate of the batch gradient of the objective function,because using partial sample gradients instead of batch gradients will introduce noise,which may affect the performance of the algorithm.However,in large-scale optimization problems,this assumption is difficult to satisfy.In order to solve the above prob-lems,this paper analyzes the convergence of the non-convex function F under the assumption that the stochastic gradient is a biased estimate of the batch gradient of the objective function,and finally obtains a gradual convergence result.However,the algorithm needs to manually adjust the step size to achieve the best training effect,and may fall into a local optimal solution,especially in the later stage of training.The adaptive gradient descent method can adaptively adjust the step size according to his-torical information,reducing the cost,so it can converge to the optimal solution faster.Such algorithms are widely used in computer vision and pattern recognition tasks.However,so far,we still lack a theoretical analysis of the convergence of such algorithms when the objective function is a non-convex function.Therefore,this article investigates the Ada Grad algorithm based on batch gradients and the Ada Grad algorithm based on stochastic gradient for non-convex objective function,respectively.For the Ada Grad algorithm with batch gradient,we studied the global step size and coordinate-wise step size respectively,and provided asymptotic convergence results under the assumption that the objective function is L-smooth.For the Ada-Grad algorithm with stochastic gradient,we also studied the global step size and coordinate-wise step size separately.Under the assumption that the objective function is L-smooth,the stochastic gradient is an unbiased estimate of the batch gradient of the objective function,and the second order moment of the stochastic gradient is uniformly bounded,we obtain the rate of non-asymptotic convergence of O(lnT/T1/2).
Keywords/Search Tags:Nonconvex optimization, SGD, Ada Grad, Convergence, Machine learning
Related items