Font Size: a A A

Research And Implement Of Distributed Deep Learning System Based On Spark

Posted on:2020-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2428330596976538Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the advent of big data and the rapid development of artificial intelligence,especially deep learning,the deep neural network model has achieved breakthrough improvement and can be applied into many fields,including speech recognition,image recognition,natural language processing and so on.Deep learning is a computationally intensive task that requires a large amount of computation.Because the model need to improve its performance by iteratively updating the parameters,the training process of these neural networks is very time consuming.Although the computer hardware,network model structure and training methods have made some progress in recent years,the training time of the single machine is still too long.Secondly,deep learning requires a large model and a large amount of training data.Research shows that the size of dataset has a linear relationship with the performance of the deep learning model,and the size of dataset will reach the PB and ZB levels in the future.As the size of dataset and model parameters grows larger,as a result the single machine's memory(or GPU memory)can't meet the requirement of deep learning task.Distributed system has good flexibility and scalability,and it can combine single-machine computation resources in an effective way,so distributed deep learning becomes an effective method to solve this problem.However,the existing deep learning framework system lacks cluster resource management and has complicated interfaces for distributed training.Under the above situation,firstly,this paper proposed a distributed deep learning method and system Dpplee3 based on Spark and Pytorch using data parallel strategy.The system uses Spark for distributed cluster resource management and complete distributed tasks such as data & model distribution.The paper analyze and study the appropriate upper-level deep learning training framework.And then,the system use Pytorch to provide functions including numerical calculation,deep learning model modeling,model training and so on.At the same time the system defines and encapsulate the interface,providing users with convenient and rapid distributed training interfaces.Generally speaking,the paper effectively combine the advantages of Pytorch and Spark to form a distributed deep learning system.Secondly,in order to further improve the flexibility,efficiency and availability of the system,this paper studies a variety of neural network model parameters update algorithms in distributed situations,and then implement ASGD and Hogwild! in the system.This two asynchronous methods are effectively adapted to the characteristics of Spark distributed clusters.At the same time,this paper further studies the distributed deep learning algorithm and parameter update mechanism,and then proposes a multi-granular asynchronous parameter update method,which enables users to control the interaction granularity between the working node and the parameter server node,which can reduce bandwidth consumption and improve the speed of distributed training.In addition,the system separates the working node optimizer from the parameter server optimizer,so that the user can flexibly set the optimizer for different training tasks.Finally,experiments are conducted on the system and method proposed in this paper.The experimental results validate the effectiveness of the proposed methods and show that system can use distributed clusters to improve the training efficiency with good usability.
Keywords/Search Tags:Distributed deep learning, Pytorch, Spark, Data parallelization
PDF Full Text Request
Related items