Font Size: a A A

Methods For Exploting Stragglers In Coded Distributed Computing System

Posted on:2022-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:R Z CuiFull Text:PDF
GTID:2518306725981149Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to its high reliability,scalability and high computing speed,distributed computing has become a common method to perform large-scale computing tasks.However,there may be straggler nodes in the distributed system,which leads to the increase of the total time required to execute computing tasks,thus limiting the performance of distributed computing.Coded distributed computing is a new paradigm of distributed computing.It uses coding method to create storage or computing redundancy to reduce the impact of unpredictable failure nodes or straggler nodes.However,most of the existing coding schemes used in master worker computing framework only use the results of a certain number of fastest working nodes to recover the output,completely ignoring the work done by other nodes,resulting in low performance.To solve this problem,this paper considers the single master node scenario and the multi master node scenario,and proposes the corresponding coding scheme respectively,which uses the working ability of all nodes in the distributed system to improve the system efficiency.The specific contributions are as follows:(1)To solve the problem of underutilization of worker node resources in distributed system with single master node,this paper describes two new concepts of ”communication at full speed ” and ”computation at full speed ”,which represent that all communication links between worker node and master node have been fully utilized,and all computing tasks completed by each worker node have been fully utilized by master node.Based on the polynomial coding framework,we propose a randomization method,that is,each worker node divides its local calculation result into several blocks,and then generates new result blocks by coding in turn and forwards them.The feasibility of this idea is proved theoretically.By mapping the encoding operation of the random method to the encoding part of the input data set,we further prove that computation at full speed can be realized in some typical task scenarios.The experimental results and some simulation results based on the real environment show that this new method can make use of the straggler nodes in the master worker system in the single master node scenario,so as to significantly reduce the task completion time and improve the system resource utilization.(2)In order to solve the problem of straggler master nodes in distributed system under multi master node scenario,this paper proposes a new computing framework including multiple master nodes.Multiple master nodes cooperate to complete the aggregation of computing results without setting additional management nodes.We propose a coding scheme based on MDS code,so that the distributed system can tolerate the straggler problem at the master node.By introducing the random coding scheme in the single master node scenario,we can make full use of the working ability of the straggler nodes in the distributed system in the multi master node scenario.Simulation results show that the new framework and coding scheme can effectively solve the bottleneck problem of master node and the straggler problem in distributed computing,and make full use of the straggler nodes,so as to improve the efficiency of the system.
Keywords/Search Tags:distributed computing, coded computing, polynomial-based coding, straggler tolerance, multi-master architecture
PDF Full Text Request
Related items