Machine learning as an important direction in the artificial intelligence has played an increasingly important role today.As machine learning algorithms solve more and more problems,people are also facing some new problem,which of large-scale data,large models,and large amount of calculation.Because these problems are not feasible on a single machine,people naturally think of using multiple high-performance computers to accelerate the training of the model.However,in most cases,it is not possible to directly use multiple machines to train.From the perspective of machine learning theory,most of the algorithms can be formalized as a problem of finding the conditional value of an objective function composed of a loss function and a regularization term.The most common way to solve this problem is to use the optimization function to solve.For the first-order function,the most commonly used optimization algorithm is the stochastic gradient descent algorithm.Therefore,we believe that the study of the parallelization of the algorithm on the stochastic gradient descent algorithm can have more significant benefits.Principal component analysis(PCA)algorithms and singular value decomposition(SVD)are widely used in various fields of machine learning.In practical applications,the time spent in solving a PCA subproblem in the entire problem is often huge.We proposed a fast distributed principal component analysis algorithm based on the stochastic gradient descent method with variance reduction.We use random sampling to update the target and use delayed synchronization as our synchronization mechanism.In recent years,due to the outstanding performance of deep neural networks in different aspects,deep neural networks have gained widespread attention as a branch of machine learning.In distributed deep learning,a common practice is to use distributed training to enable massive amounts of data to be processed in parallel between multiple machines and multiple cards in heterogeneous cluster with CPUs and GPUs,to greatly increase the convergence speed.The detailed analysis of the distributed deep learning task into several different phases can reveal that cross-machine communication is often the bottleneck,and it is the part that needs to be optimized in large-scale deep learning tasks.On the one hand,we analyze the existing algorithm bottlenecks in detail from the perspective of algorithm research,and propose a communication strategy based on twostep reduction to reduce the time for aggregation gradients between different machines.Further,we propose a distributed gradient descent method based on two-step reduction for large-scale deep neural networks.On the other hand,we analyzed the differences and connections between RDMA,InfiniBand and Ethernet,TCP/IP,and the shortcomings of Socket-based communication methods from the perspective of engineering applications.We designed and implemented an efficient communication interface for distributed deep learning based on InfiniBand's native standard library.We use different communication modes depending on the size and use of data packets,using RDMA to achieve high throughput and low CPU overhead.Also,our communication interface is implemented in an asynchronous manner to maximize the performance advantages of RDMA. |