With the continuous development of internet technology,the amount of data is showing exponential growth.Big data has facilitated the distributed model training of data parallelism,and reducing the enormous communication costs of large-scale distributed model training is currently a hot research topic.Currently,some efforts attempt to reduce communication costs through gradient compression and model pruning.Since the emergence of programmable switches with in-network computing capabilities,such as the Intel Tofino switch,a method called in-network aggregation has successfully reduced the communication costs of parameter aggregation stages and alleviated bandwidth bottlenecks in clusters.However,the parameter update stage is equally important in distributed model training,and existing solutions have never considered reducing the communication costs of the parameter update stage.To address this issue,we have mainly done the following two pieces of work.Firstly,we have considered the in-network computing capabilities of programmable switches and constructed multicast trees for parameter update in distributed model training.Under the constraints of programmable switch resources,we propose a parameter update optimization algorithm called INMNT,based on random rounding,which reduces redundant data transmission during the parameter update process and thereby reduces cluster bandwidth consumption.The implemented approximation ratio of the algorithm is O(log |V|),where V is the number of programmable switches.We conducted small-scale experiments on programmable switches,demonstrating the contribution of our algorithm in accelerating communication in distributed model training.Additionally,to validate the algorithm’s performance,we conducted large-scale simulation experiments.By simulating the algorithm,we conducted large-scale experiments,and the results show that our solution can reduce downstream communication costs by 14.5%to 35.8%compared to the state-of-the-art solution.Secondly,since data centers often run massive tasks,building a multicast tree for each parameter update session could potentially result in resource constraints or violations in the network.Therefore,we investigate the multicast and reconfiguration problems for parameter update sessions with capacity constraints on links and programmable switches in a multi-task scenario,with the goal of maximizing training throughput.Firstly,we construct multiple multicast trees for each session based on various resource utilization scenarios.We consider optimizing multicast tree placement in multi-task scenarios and take into account offline and online scenarios as complementary situations.For offline scenarios,we propose an efficient algorithm called KMMT,which has an approximate ratio of O(1/(logn/α+1)),where α≥1 is a function of the minimum link/programmable switch capacity and maximum multicast demand,and n is the number of devices in the network.For online scenarios,we have developed a method called OMSM with a competitive ratio of O(logn/α+1),which balances network throughput and continuity of existing traffic.Additionally,we have devised a multicast reconfiguration method to prevent congestion in multicast reconfiguration,which has the same asymptotic throughput guarantee as the KMMT algorithm.Extensive simulations have shown significant performance improvements of our algorithms compared to previous multicast algorithms. |