Font Size: a A A

A Distributed Operating Platform For Deep Learning

Posted on:2022-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2518306605466944Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous implementation of artificial intelligence technology,the markets demand for deep learning has become more extensive.Deep learning technology has ushered in a new stage of development,especially in the areas of intelligent recommendation and autonomous driving.Deep learning technology has been integrated into the daily lives of residents and is gradually changing peoples lifestyles.Generally speaking,the process of deep learning training is often accompanied by a lot of computational work.Thanks to the continuous improvement of GPU equipment performance,cumbersome calculations are becoming more and more common today.However,as the amount of data in the data set gradually increases and the level of the model becomes more and more complex,ordinary training methods begin to fail to meet the rapidly increasing computing needs of deep learning.How to provide a more efficient and lowercost deep learning training method to provide users with more convenient,large-scale and high-performance deep learning services has become a new research direction.This thesis analyzes various problems in deep learning training through in-depth investigations on the deployment of deep learning environment,task execution process and deep learning framework.Designed and implemented an online operating platform that uses a distributed approach to execute deep learning tasks on a large scale.The architecture design and development of the platform are mainly based on the design concept of meeting the diverse needs of users,improving the utilization of computing resources,and achieving efficient management and control of the platform.The main work of the thesis is as follows:(1)Based on Spring Boot,a set of deep learning distributed operating platform server-side software framework is designed and implemented,including user demand analysis,designing the overall architecture of the system,and developing modules according to functions.According to the large amount of calculation of deep learning tasks,combined with the actual needs of users,the system is logically split into user modules,task modules,and file modules for design and development.According to user needs,deep learning tasks are divided into three types: model development,training and prediction,and distributed training.Different implementation processes are defined for different task types to perform development work respectively.(2)In view of the complex deployment and maintenance of the deep learning operating environment,on the basis of in-depth investigation of the deep learning environment construction process and the principle of containerization,a Docker image with the deep learning operating environment was constructed.This thesis chooses three deep learning frameworks,namely Tensor Flow,Py Torch,and MXNet to build them separately.(3)In response to the complex execution process of deep learning tasks and the large amount of calculation,the executor plug-in for executing deep learning tasks was designed and implemented,and the execution flow of deep learning tasks was defined on the container side.Realize task parameter reception,task execution,execution result and log upload.(4)In view of the uneven distribution of computing resources and difficulty in management,by investigating the resource management plan of the Kubernetes cluster,GPU computing resources are registered in the cluster for unified allocation and invocation to achieve efficient management and utilization of computing resources.(5)For the problem of high storage requirements for deep learning tasks,through in-depth investigation of various distributed storage solutions,object storage is used to store files.By deploying the Ceph file system on the cluster,it meets the needs of users for storing large files and improves the scalability of the file system.Finally,based on the design and implementation of the platform,the platform system is deployed on a single machine and a cluster respectively,and the platform is tested for serverside and client-side functions,and the JMeter test tool is used to simulate the production environment to test the performance of the platform.The test results show that the platform provides a complete and convenient deep learning homework service,meets the development needs of most users,and performs well in system response time.At the same time,the platform has better scalability and higher throughput.
Keywords/Search Tags:deep learning, task, Kubernetes, mirrors, platform
PDF Full Text Request
Related items