Font Size: a A A

On The Depth And Big Model Of Deep Neural Networks: Theory And Algorithm

Posted on:2019-06-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Z SunFull Text:PDF
GTID:1368330599965129Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,deep neural networks(DNN)has achieved great success in many applications.Actually,DNN did not become widely-used until 2006,although many technologies used in neural networks have been proposed in the 1990 s.Essentially,there are two kinds of driven force for the success of DNN after 2006,i.e.,the increasing depth and the growing model size.To successfully increase the depth,many techniques have been proposed,e.g.,auto-encoder,batch normalization,residual network,etc.Simultaneously,to efficiently handle the growing model size,parallel training frameworks have been proposed,such as data parallelism and model parallelism.However,if we want to move towards better deep learning,these techniques are far from enough.First,for the depth,although there are many techniques to increase the depth,an important question is how to understand the advantage and disadvantage of the depth from the theoretical view.Second,most of parallel algorithms are directly inherited from the convex problem.However,DNN is highly non-convex model.Therefore,a natural question is how to handle the non-convexity of DNN during the parallel training.Third,another difference between DNN and the traditional shallow models is that there are many redundant parameters in DNN,which will cause extremely high communication cost during the parallel training.Therefore,how to handle such redundancy during the parallel training is another challenge.To tackle these challenges,this thesis makes following investigations.First,we propose uniform upper bounds for the representation ability and the model capacity of DNN.Based on these bounds,we analyze the pros and cons of depth.And,based on the above theoretical analysis,we propose to improve the performance of DNN by maximizing the margin.Second,we proof that model average,which is used as model aggregation method during the data parallelism,cannot provide performance guarantee for the global model.Therefore,we propose to use ensemble as the model aggregation method,and design a new parallel training framework based on the ensemble.Third,we propose to regard the communication-efficient distributed deep learning as a multiagent system,and give concrete definitions for the actions,environments and utility.Based on such multi-agent system,we propose to use best response strategy to reduce the communication cost,i.e.,only transfer the non-redundant parameters(or gradients)during the communication.
Keywords/Search Tags:deep learning, generalization, distributed machine learning, data parallelism
PDF Full Text Request
Related items