Research On Container Cluster Management System For Deep Learning Applications

Posted on:2020-12-15

Degree:Master

Type:Thesis

Country:China

Candidate:C S Xie

Full Text:PDF

GTID:2428330620954038

Subject:Computer system architecture

Abstract/Summary:

With the rise of artificial intelligence,lots of deep learning jobs run in data center.The container technology could reduce the difficulty of deploying multiple deep learning jobs for data center administrator.As a result,more and more deep learning jobs are submitted to data center in a containerized environment.The training process of a deep learning job needs to rely on the distributed machine learning framework.Because of frequent data interaction among the framework's different components,the network communication rate is easy to be the performance bottleneck of the job training process.However,the scheduling policies of existing container cluster orchestration systems don't consider the performance difference between the cross-node network transmission and the intra-node data communication of framework's components,so these policies are easy to schedule the framework's containerized components to cluster's different nodes,which causes longer data transmission time among framework's different components and lower training speed of the job.Meanwhile,the virtual networking stack of a container is longer than the networking stack of the bare metal,which causes lower communication rate among the framework's containerized components.Besides,existing container cluster orchestration systems don't provide a mechanism that could dynamically adjust the occupied resources of each containerized component during the training process,which causes a low computing resource utilization of the framework and longer completion time of the deep learning job.To address the above problems,a Container cluster Management System for Deep Learning applications(CMS-DL)is presented.In the container scheduling stage,the system uses two-stage scheduling policy to ensure that the containerized components which belong to the same framework are scheduled to the same node,so that the amount of the cross-node network packets among a framework's different components could be reduced.In the container creation stage,the system adopts simplified container networking stack mechanism to shorten the network packet transmission path among the framework's containerized components.In the container running stage,the system dynamically adjusts the available resources of each containerized component in a framework so that the computing resource utilization of this framework could be improved.CMS-DL effectively improves the network communication rate among the framework's containerized components and reduces the data transmission time of the different containerized components,and thus the training time of deep learning jobs is reduced.CMS-DL is implemented on the Kubernetes cluster.The test results show that compared with the overlay network mode of Docker container,CMS-DL reduces the training time of one single deep learning job by 0.6%-20.1%.Compared with Kubernetes default scheduling policy and greedy scheduling policy,CMS-DL reduces the training time of the cluster's deep learning jobs by 42.3% and 12.3%,respectively.

Keywords/Search Tags:

Container Cluster, Scheduling, Deep Learning Job, Training Time, Container Networking

Related items

1	Research On Key Technologies Of Container Scheduling For Computer Cluster
2	Research On Flexible Scheduling Method Of Real-time Multi Workflow In Container Cloud Environment
3	Container Cluster Management Method For Large-scale Heterogeneous Evaluation Tasks
4	Research On Load Prediction And Container Scheduling Technology In Container Cloud Environment
5	Research On Preemptive Scheduling Technology For Container-based Cluster
6	Research On Adaptive Container Scheduling Optimization Oriented To Cluster Environment
7	Design And Implementation Of Container Cluster Scheduling Approach Base On Docker
8	Research And Implementation Of Container Scheduling On Container Cloud Platform
9	Research And Implementation Of Microsoft Service Deployment Scheduling Under Autonomous Container Cloud
10	Research On Horizontal Scheduling System Of Container Terminals Coordinating Multiple Operations