Research And Implement Of Distributed Deep Learning System Based On Spark

Posted on:2020-12-14

Degree:Master

Type:Thesis

Country:China

Candidate:M Li

Full Text:PDF

GTID:2428330596976538

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,with the advent of big data and the rapid development of artificial intelligence,especially deep learning,the deep neural network model has achieved breakthrough improvement and can be applied into many fields,including speech recognition,image recognition,natural language processing and so on.Deep learning is a computationally intensive task that requires a large amount of computation.Because the model need to improve its performance by iteratively updating the parameters,the training process of these neural networks is very time consuming.Although the computer hardware,network model structure and training methods have made some progress in recent years,the training time of the single machine is still too long.Secondly,deep learning requires a large model and a large amount of training data.Research shows that the size of dataset has a linear relationship with the performance of the deep learning model,and the size of dataset will reach the PB and ZB levels in the future.As the size of dataset and model parameters grows larger,as a result the single machine's memory(or GPU memory)can't meet the requirement of deep learning task.Distributed system has good flexibility and scalability,and it can combine single-machine computation resources in an effective way,so distributed deep learning becomes an effective method to solve this problem.However,the existing deep learning framework system lacks cluster resource management and has complicated interfaces for distributed training.Under the above situation,firstly,this paper proposed a distributed deep learning method and system Dpplee3 based on Spark and Pytorch using data parallel strategy.The system uses Spark for distributed cluster resource management and complete distributed tasks such as data & model distribution.The paper analyze and study the appropriate upper-level deep learning training framework.And then,the system use Pytorch to provide functions including numerical calculation,deep learning model modeling,model training and so on.At the same time the system defines and encapsulate the interface,providing users with convenient and rapid distributed training interfaces.Generally speaking,the paper effectively combine the advantages of Pytorch and Spark to form a distributed deep learning system.Secondly,in order to further improve the flexibility,efficiency and availability of the system,this paper studies a variety of neural network model parameters update algorithms in distributed situations,and then implement ASGD and Hogwild! in the system.This two asynchronous methods are effectively adapted to the characteristics of Spark distributed clusters.At the same time,this paper further studies the distributed deep learning algorithm and parameter update mechanism,and then proposes a multi-granular asynchronous parameter update method,which enables users to control the interaction granularity between the working node and the parameter server node,which can reduce bandwidth consumption and improve the speed of distributed training.In addition,the system separates the working node optimizer from the parameter server optimizer,so that the user can flexibly set the optimizer for different training tasks.Finally,experiments are conducted on the system and method proposed in this paper.The experimental results validate the effectiveness of the proposed methods and show that system can use distributed clusters to improve the training efficiency with good usability.

Keywords/Search Tags:

Distributed deep learning, Pytorch, Spark, Data parallelization

PDF Full Text Request

Related items

1	Distributed Deep Learning Platform DisPyTorch
2	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
3	A Study On Parallelization Strategy Of Distributed Deep Learning
4	Design And Implementation Of Big Data Analysis And Forecast System Based On Deep Learning
5	The Parallelization And Optimization Of K-means Algorithm Based On Spark
6	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
7	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
8	Model Training Performance Analysis Of Typical Deep Learning Frameworks In The Single GPU Environment
9	The Parallelization And Optimization Of Fp-Growth Algorithm Based On Spark
10	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN