At present,emerging information technology industries such as cloud computing,big data,and artificial intelligence are developing rapidly,and the demand for computing power is becoming more and more urgent.However,when the scale of high-performance computing systems is continuously increased to increase computing power,the interconnection network of the system will also become more and more complex,so the communication overhead between computing nodes becomes one of the main bottlenecks of high-performance computing systems.For example,in the research on MPI,the most commonly used parallel programming model for high-performance computing,some studies have shown that MPI collective communication has a great impact on the performance of high-performance computing applications.About 70%of the time is used for collective communication,and as the system scale increases,the communication overhead also increases.Therefore,it is an important challenge to optimize communication reasonably.In the past,the optimization of MPI collective communication was mainly based on software optimization.But it was limited by the upper limit of algorithm performance improvement and could not continuously improve communication performance.With the development of hardware equipment technology and corresponding programmable languages,in-network computing technology has developed rapidly.In-network computing mainly uses the functions of intelligent network hardware devices such as programmable switches and SmartNICs to offload computing to network devices.The computing data will be processed during network transmission,to utilize the high forwarding capabilities of network devices efficiently,at the same time,communication overhead and CPU load are reduced.Currently using in-network computing technology,by extending the functions of network communication devices,the computing in scientific computing tasks is offloaded from the CPU to specific network communication devices(such as programmable switches),and further optimizing communication has become a research hotspot in the field of high-performance computing.This paper firstly analyzes the MPI communication characteristics of scientific applications commonly used in the supercomputing of the University of Science and Technology of China and finds that Allreduce collective communication accounts for more than 50%of the communication overhead of scientific applications.Aiming at the large time overhead of collective communication in the supercomputing scientific application of the University of Science and Technology of China,this paper proposes a method to optimize the MPI collective communication and reduce the communication overhead based on the in-network computing technology.Through the Ethernet RoCE protocol,the programmable switch and the MPI library after the expansion of the in-network computing function module,the method realizes that the calculation of the MPI collective communication is offloaded to the programmable switch,so that the data is transmitted when the calculation is performed,achieve the purpose of reducing communication delay and computing load of server nodes.For this communication optimization method,this paper designs and implements an in-network computing optimization mode in general scenarios and another in-network computing optimization mode that further improves communication performance in a scenario where the server has multiple physical CPUs and load balancing.In Hanhai 20 supercomputing system conducts the collective communication benchmark test and the application test experiment under the two optimization modes of in-network computin g.The benchmark test results show that,compared with the host-based communication,the in-network computing scheme in this paper has a speedup ratio of up to 2.4 for collective communication Allreduce,and a speedup ratio of up to 3.1 for Barrier.The experimental results of the application test show that,compared with the host-based communication,the in-network computing scheme in this paper achieves the highest speedup ratio of 1.14 under 16 nodes.Both experimental results prove the effecti veness of the optimization method in this paper and provide a certain reference value for subsequent related research in this field,and the code of this paper has been open-sourced on GitHub. |