| With the rise of artificial intelligence,lots of deep learning jobs run in data center.The container technology could reduce the difficulty of deploying multiple deep learning jobs for data center administrator.As a result,more and more deep learning jobs are submitted to data center in a containerized environment.The training process of a deep learning job needs to rely on the distributed machine learning framework.Because of frequent data interaction among the framework's different components,the network communication rate is easy to be the performance bottleneck of the job training process.However,the scheduling policies of existing container cluster orchestration systems don't consider the performance difference between the cross-node network transmission and the intra-node data communication of framework's components,so these policies are easy to schedule the framework's containerized components to cluster's different nodes,which causes longer data transmission time among framework's different components and lower training speed of the job.Meanwhile,the virtual networking stack of a container is longer than the networking stack of the bare metal,which causes lower communication rate among the framework's containerized components.Besides,existing container cluster orchestration systems don't provide a mechanism that could dynamically adjust the occupied resources of each containerized component during the training process,which causes a low computing resource utilization of the framework and longer completion time of the deep learning job.To address the above problems,a Container cluster Management System for Deep Learning applications(CMS-DL)is presented.In the container scheduling stage,the system uses two-stage scheduling policy to ensure that the containerized components which belong to the same framework are scheduled to the same node,so that the amount of the cross-node network packets among a framework's different components could be reduced.In the container creation stage,the system adopts simplified container networking stack mechanism to shorten the network packet transmission path among the framework's containerized components.In the container running stage,the system dynamically adjusts the available resources of each containerized component in a framework so that the computing resource utilization of this framework could be improved.CMS-DL effectively improves the network communication rate among the framework's containerized components and reduces the data transmission time of the different containerized components,and thus the training time of deep learning jobs is reduced.CMS-DL is implemented on the Kubernetes cluster.The test results show that compared with the overlay network mode of Docker container,CMS-DL reduces the training time of one single deep learning job by 0.6%-20.1%.Compared with Kubernetes default scheduling policy and greedy scheduling policy,CMS-DL reduces the training time of the cluster's deep learning jobs by 42.3% and 12.3%,respectively. |