Font Size: a A A

Hybrid Optical And Electrical Interconnection Networks For Distributed Machine Learning

Posted on:2023-11-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F LuFull Text:PDF
GTID:1528306905997029Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Driven by technologies such as big data processing,supercomputing,and economic and social development,the field of artificial intelligence has developed rapidly in recent years,resulting in disruptive technological changes in many fields.Emerging applications such as machine learning place higher demands on interconnected networks.Ultra-large-scale training data and training models pose new challenges to the computing power and storage capacity of devices.Interconnection networks should be more scalable to accommodate the ever-expanding training data and neural network models.The diversity of model splitting methods,parameter synchronization algorithms,and training models also put forward higher requirements for the flexibility of the interconnection network,which should have the ability to adapt to the communication demands of different applications.However,the scalability and flexibility of traditional electrical networks are difficult to meet these requirements.Optical interconnects have obvious advantages in network bandwidth,switching capacity,system cost,and energy consumption.Researchers have also proposed many network architectures using optical interconnect technologies in recent years,including: Mordia,OSA,Quartz,Flexfly,Rotor Net,and Flat-tree.Although the above architectures provide ideas for the application of optical interconnects in data centers and high-performance computing,the theoretical solution is still far from large-scale deployment and application.Currently,the applications of optical interconnects include wired communication and free space optics.Optical interconnects based on wired communication mostly use customized optical switching equipment to achieve higher configuration speed,thereby actively responding to the communication requirements between servers.However,the customized optical switching equipment does not have large port count,and on the other hand,it is difficult to produce and put into use on a large scale,so it will limit the scalability and achievability of the network.The optical interconnects based on free space optics can use more bandwidth and higher frequency,and this method can realize the reconfiguration of the connection between nodes without the limitation of distance.Although each node can have a higher fan-out by adding free space optics terminals,this approach is really susceptible to environmental influences.In addition,the existing architectures have the problem of high overhead in traffic collection,topology calculation,and optical switch configuration on the control plane,and the network cannot quickly respond to the changes in communication requirements.The main works and research achievements of this thesis are as follows:1.The existing optical interconnects are comprehensively reviewed,and its scalability and flexibility are qualitatively compared from the aspects of communication mode,control mode,and switching equipment of each network structure.The traffic patterns and performance indicators in each simulation platform as well as the device parameters of the experimental platform are also described,which provides reference for analyzing the performance of the network and verifying the feasibility of the scheme in the subsequent research work.Finally,the future trends and challenges of optical interconnects are presented,including optical transmission methods,optical switching equipment,and optical control technologies.2.To meet the scalability requirements of distributed machine learning network and the bandwidth gap between in-rack and out-of-rack,we proposed a hybrid optical and electrical interconnection network Lotus with fixed topology.It combines the wavelength routing and fast switching characteristics of AWGR,and is suitable for several classical synchronization algorithms.In Lotus,a complete bipartite graph is used within the group to improve the bandwidth and network scalability,and AWGRs are used between adjacent groups to improve the path diversity and network reliability.In addition,the routing algorithm proposed for Lotus can improve the utilization of global links between groups and avoid the increase of end-to-end communication latency caused by detours.The fixed wavelength assignment is simpler and suitable for communication in large-scale interconnection networks.The simulation experiments simulate the communication characteristics of different synchronization algorithms.Compared with 3D-Torus and Dragonfly networks,Lotus has good adaptability to these classical synchronization algorithms.3.For the flexibility of distributed machine learning networks and the difficulty of traditional electrical networks to match the diverse neural network models,scheduling strategies,and communication granularity,we proposed a hybrid optical and electrical interconnection network X-NEST with MEMS switches.It uses the flexibility of MEMS switches to realize on-demand allocation of link resources,and can quickly adjust network topology according to traffic changes.In order to maintain the working state of the network,the MEMS switches in different parts are alternately reconfigured.The simulation results show that the performance of X-NEST is close to or even better than that of Fat-tree under the traffic patterns simulating various synchronization algorithms.4.To solve the problem of the lack of control plane in commercial optical switches and the excessive cost of control plane in existing researches,we proposed a fast control plane for large scale optical interconnects by optimizing traffic collection,topology calculation,and optical switch configuration.In terms of traffic collection,the method of transmitting traffic information through UDP can effectively reduce the overhead,and the control plane has also formulated a corresponding anti-loss mechanism.In terms of topology calculation,the configuration problem of multiple MEMS switches is divided into a combination of several fixed MEMS configurations,which effectively reduces the complexity.When the network is reconfigured,OSPF can sense topology changes and choose the best communication path.This approach can not only reduce the burden on the control plane,but also contribute to the widespread application of optical switching technology in traditional electrical networks.In terms of optical switch configuration,the pre-stored configurations on the Polaris optical switch can simplify the configuration commands and speed up the configuration process.
Keywords/Search Tags:artificial intelligence, distributed machine learning, interconnection network, optical interconnects, control plane
PDF Full Text Request
Related items