A Distributed Operating Platform For Deep Learning

Posted on:2022-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:X Zhang

Full Text:PDF

GTID:2518306605466944

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,with the continuous implementation of artificial intelligence technology,the markets demand for deep learning has become more extensive.Deep learning technology has ushered in a new stage of development,especially in the areas of intelligent recommendation and autonomous driving.Deep learning technology has been integrated into the daily lives of residents and is gradually changing peoples lifestyles.Generally speaking,the process of deep learning training is often accompanied by a lot of computational work.Thanks to the continuous improvement of GPU equipment performance,cumbersome calculations are becoming more and more common today.However,as the amount of data in the data set gradually increases and the level of the model becomes more and more complex,ordinary training methods begin to fail to meet the rapidly increasing computing needs of deep learning.How to provide a more efficient and lowercost deep learning training method to provide users with more convenient,large-scale and high-performance deep learning services has become a new research direction.This thesis analyzes various problems in deep learning training through in-depth investigations on the deployment of deep learning environment,task execution process and deep learning framework.Designed and implemented an online operating platform that uses a distributed approach to execute deep learning tasks on a large scale.The architecture design and development of the platform are mainly based on the design concept of meeting the diverse needs of users,improving the utilization of computing resources,and achieving efficient management and control of the platform.The main work of the thesis is as follows:(1)Based on Spring Boot,a set of deep learning distributed operating platform server-side software framework is designed and implemented,including user demand analysis,designing the overall architecture of the system,and developing modules according to functions.According to the large amount of calculation of deep learning tasks,combined with the actual needs of users,the system is logically split into user modules,task modules,and file modules for design and development.According to user needs,deep learning tasks are divided into three types: model development,training and prediction,and distributed training.Different implementation processes are defined for different task types to perform development work respectively.(2)In view of the complex deployment and maintenance of the deep learning operating environment,on the basis of in-depth investigation of the deep learning environment construction process and the principle of containerization,a Docker image with the deep learning operating environment was constructed.This thesis chooses three deep learning frameworks,namely Tensor Flow,Py Torch,and MXNet to build them separately.(3)In response to the complex execution process of deep learning tasks and the large amount of calculation,the executor plug-in for executing deep learning tasks was designed and implemented,and the execution flow of deep learning tasks was defined on the container side.Realize task parameter reception,task execution,execution result and log upload.(4)In view of the uneven distribution of computing resources and difficulty in management,by investigating the resource management plan of the Kubernetes cluster,GPU computing resources are registered in the cluster for unified allocation and invocation to achieve efficient management and utilization of computing resources.(5)For the problem of high storage requirements for deep learning tasks,through in-depth investigation of various distributed storage solutions,object storage is used to store files.By deploying the Ceph file system on the cluster,it meets the needs of users for storing large files and improves the scalability of the file system.Finally,based on the design and implementation of the platform,the platform system is deployed on a single machine and a cluster respectively,and the platform is tested for serverside and client-side functions,and the JMeter test tool is used to simulate the production environment to test the performance of the platform.The test results show that the platform provides a complete and convenient deep learning homework service,meets the development needs of most users,and performs well in system response time.At the same time,the platform has better scalability and higher throughput.

Keywords/Search Tags:

deep learning, task, Kubernetes, mirrors, platform

PDF Full Text Request

Related items

1	Design And Implementation Of Task Management System For Deep Learning Based On Kubernetes
2	Design And Implementation Of Deep Learning Container Cloud Platform Based On Docker And Kubernetes
3	Research And Application Of GPU Scheduling Strategy And Task Parallelization Method On Deep Learning Cloud Platform
4	Research On Deep Learning Task Scheduling Based On Small Scale GPU Cluster Platform
5	Research And Application Of Task Scheduling Method In Heterogeneous Video Surveillance Cloud Computing Platform
6	The Design And Implementation Of The GPU Resource Management Component In Transwarp Container Platform
7	The Internal Integration Design And Implementation Of PaaS Platform System Based On Kubernetes
8	Research On Deep Learning-Based Representation Learning Algorithms
9	Design And Implementation Of Facial Feature Point Positioning Method Based On Deep Learning
10	Design And Implementation Of Cloud-Native Application Platform Based On Kubernetes