Nowadays,deep learning detonates revolution of science and technology and creates a revolution in the field of multiple and leapfrog development.As the core of information technology,deep learning has shown remarkable success in diverse domains including image recognition,speech recognition and language processing application.At the same time,deep learning brings technology and visual development to promote biology and medicine and makes important achievements such as drug discovery and genomics.Deep learning will become a new generation of mainstream of information science research methods and gradually evolve into general and basic technology.The performance of neural networks critically depends on their model structures and the corresponding learning algorithms.And the distribution of connection strength between neural network units contains all the information of neural network and affects the convergence speed of neural network.So learning algorithm is the key to the neural network weight correction and structure optimization,which determines the performance of the neural network.Thus,the problem of the optimization of neural network is a very important subject of the intelligent computing research field,especially the weights of neural networks and learning algorithms.At present,most learning algorithms are based on iterative methods in deep learning,which aim to find a set of parameters to optimize auxiliary objective function with respect to weights.Gradient-based optimization is of great practical importance in many fields of science and engineering,many problems in this field can be cast as the optimization of some scalar parameterized objective function requiring optimization with respect to its parameters,such as SGD.These algorithms process stochastic mini-batches of data at each iteration to update parameters by taking small gradients steps.It can be difficult to choose an appropriate learning rate.A learning rate that is too small leads to painfully slow convergence,while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.The general idea of the algorithms is that gradient computed on subset is to approximate true gradient on the whole datasets,and the algorithms use mini-batches of data instead of all data which causes gradient noise and variance.But if compute gradient using all data,it will be compute-intensive and requires much storage of gradients.So,we mainly devote to learning rate and the gradient variance in order to improve the accuracies and reduce the loss and so on.For the problems of learning rate,many gradient-based optimization methods contain more sensitive hyper-parameters which require endless ways of configuring.Learning stability rests highly upon the choices of hyper-parameters,and more hyper-parameters will result in endless ways of configuring hyper-parameters.For the problems of variance,many algorithms based on variance reduction technique require storage of all gradients or dual variables and are compute-intensive.In view of the learning rates of gradient-based algorithms,there are many researches to improve these shortcomings in recent years.Among them,the main modification is to make learning rate to vary to meet different parameters.About learning rates,the main work on the basis of the existing research has two aspects:(1)we present a novel adaptive mechanism and apply the mechanism to Adadelta and Adam.(2)We use the algorithms to the LSTM and LeNet on the dataset of semanteme and image,it shows the effectiveness of our algorithms on the neural networks from train accuracy,test accuracy,train error and test error.The experimental results show that both in the text dataset or in the image dataset,the algorithm has achieved satisfactory results in the classification and loss.For the problem of gradient variance,we present a novel variance reduction technique termed SMVRG.We always need a small learning rate due to the variance of SGD.In this paper,we propose a variance reduction technique regarding moving average of gradients as average gradients.At each time,we keep a version of weight which is close to the optimal weight.Such as,we can keep the value of weight as the immediate optimal weight after n SGD iterations.Moreover,we use the moving average gradient as the average gradient,thus we only need to preserve current gradient and the previous average gradient.The experimental results show that the gradient of storage space of the proposed algorithm is less compared with the existing algorithms,and our algorithm achieves good effect with a relative learning rate,at the same time the algorithm has better classification accuracy and the loss value in both train set and test set. |