Font Size: a A A

Design And Implementation Of Task Scheduling Subsystem In Distributed Deep Learning Inference System

Posted on:2020-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y H HuFull Text:PDF
GTID:2428330575966299Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a branch of machine learning,in recent years,deep learning has attracted great attentions from industry and academia,and has made remarkable progress.It has been widely used in computer vision,speech recognition and other fields.Deep learning uses deep neural network model to classify and recognize data,which can be divided into two phase:training and inferencing.The application of deep learning to practical production environment is mainly concerned about the inferencing phase.IDC,a world-renowned Market Research company,estimates that 80%of the computing power in future large-scale applications of AI will be concentrated on inferencing.Nowadays,a large number of related studies are devoted to improving the computational efficiency of deep learning inferencing phase,such as NVIDIA's GPU device,ASICs for deep learning developed by Cambrian in China,and deep learning accelerators based on FPGA.These works greatly improved the efficiency of deep learning inferencing performance on single n-ode.However,when processing massive data,such as websites like YouTube need to censor the content of the massive video uploaded by users,the performance of a single node is still insufficient.Distributed system has always been an important way to pro-vide computational capacity.Therefore,it is necessary and urgent to build a distributed deep learning inferencing system.And task scheduling is a very important compo-nent of a distributed computing system.Thus this dissertation focuses on the design and implementation of task scheduling subsystem in a distributed inferencing system.With the spring up of deep learning accelerators,the software and hardware environ-ments in distributed deep learning inference systems are complex and readily change.Task scheduling systems must flexibly support these new hardware and flexibly adjust scheduling strategies.This dissertation focuses on the design and implementation of task scheduling subsystem in distributed inferencing system.Based on the research and analysis of task scheduling problem in distributed system,a task scheduling subsystem with high reliability,efficiency and flexibility is designed and implemented,including task management mechanism and system information management mechanism.Task management mainly completes task scheduling.The basic function of system informa-tion management is to collect and process various system information and provide basis and guidance for task management.The work of the two parts is as follows:1.The task management mechanism based on Master-slave model is designed and implemented.The main work of this part is as follows.Realized the basic abilities of task management,including task division and distribution,task monitoring,task mi-gration and result collection.At the same time,the fault-tolerant mechanism to ensure the robustness of the system is designed and implemented.In order to avoid the SPOF(Single Point of Failure)problem of the master node,the fault-tolerant mechanism of the master node based on hot standby is designed.In order to ensure the correct exe-cution of a job when a worker node fails,the fault-tolerant mechanism of the working node is designed.2.Design and implement the system information management mechanism in the task scheduling subsystem.In order to cope with the dynamic environment of software and hardware in distributed systems and the diversity of characteristics of various deep learning application,the system information management mechanism supports the cus-tomization of system information collection and processing strategies,and provides a universal data access interface,thus enabling the task scheduling subsystem to dynam-ically adjust scheduling strategies according to the software and hardware environment and application characteristics.Finally,the robustness,performance and scalability of the task scheduling sub-system designed in this dissertation are validated based on the built distributed deep learning inferencing system.The performance of classifying ImageNet using several classical neural network models such as alexnet and googlenet is tested.The experi-mental results show that under the scale of 100 nodes,the performance of the distributed scheduling subsystem is better than that of the traditional neural network models such as alexnet and googlenet.The system achieves 37.7x to 90.6x acceleration ratio for single node computing performance.At the same time,the scalability of the system is tested.The speed up of the system roughly shows a linear growth trend when the node size gradually expands from 20 to 100 nodes.In addition,it is verified that the system can still work normally and has good robustness under unexpected circumstances such as node outage.In this dissertation we designed and implemented a task scheduling subsystem for a distributed deep learning inference system,which has the ability to adjust the task scheduling strategy according to the changes of hardware and software environment,in order to cope with the characteristics of deep learning hardware and deep learning applications,which have a variety of types and develop iteratively rapidly.
Keywords/Search Tags:Deep Learning Inferencing, Task Scheduling, System Information Management, System Extensibility
PDF Full Text Request
Related items