Font Size: a A A

A Simple Strategies For Path-Based Multicasting In Wormhole-Routed Meshes

Posted on:2010-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:AMNEH AHMAD ABDELRAHMAN OBEIDFull Text:PDF
GTID:1118360272496720Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Communications among processors of a distributed memory parallel computer are often the main causes of performance degradations. Indeed, in most systems"the elementary communication time"is still larger than the"elementary computation time". Therefore it is of a main interest to focus on the development of efficient communication strategies for distributed memory multiprocessors. From a single source, three kinds of communication problems can be distinguished. Unicast, Multicast and Broadcast.1- Unicast operations. This corresponds to one-to-one communication in which one source sends the same message to only one destination node (Point-to-Point). 2- Multicast operations. The essential pattern in new multicomputer generations is the multicast wormhole pattern, which corresponds to one-to-many communication in which one source sends the same message to multiple destination nodes. Existing multicast routing algorithms can be classified as unicast-based, path-based and tree-based. In unicast-based multicast algorithms, a source node sends messages to its destinations by sending a sequence of separate messages to each destination. This means that only one source and one destination (point-to-point). The disadvantage of the unicast-based approach is the large number of startups required to send a message to a large set of destination. In path-based facility header consists of a list of destination addresses that must be reached in the specified order. More precisely, a header consists of an ordered sequence of addresses @(v1) @(v2)…@(Vk) meaning that the message must go first to v1, next to v2 and so on. When the flits of a message reach an intermediate destination vi, the address @(vi) is removed from the header, and they can be copied to the local memory while they continues in order to reach the next destination specified by the header, namely @(vi+1). A message is removed from the network when it reaches its last destination. In this way, a message can be delivered to several destinations with the same startup latency as a message sent to a single destination. In tree-based routing, a multidestination message is routed through the network at each intermediate node along a multidestination message's path using all header flits in the message; no ordering of the destinations is required before the message is injected into the network.3- Broadcast operations. This corresponds to one-to-all communication in which one source sends the same message to all possible destinations in the network. In this thesis, we study the second class of problems. The multicast problem has been studied under both store and forward routing and wormhole routing. In this thesis, we focus on the multicast problem under wormhole routing.Wormhole routing. One of the efficient message routing algorithms of multicomputers is wormhole-routed. Wormhole switching has been widely used in practice due firstly to its low buffering requirements, allowing for efficient router implementation. Secondly, and more importantly, it makes latency almost independent of the message distance in the absence of blocking. In wormhole-routed networks, packets are divided into flits. A flit is the smallest unit of information that a channel can accept or refuse. Wormhole routing operates by advancing the head of a packet directly from incoming to outgoing channels. The transmission from the source node to the destination node is done through a sequence of routers. All flits in the same packet are transmitted in order as pipelined fashion. Only the header flit knows where the packet is going, and the remaining data flits must follow the header flit. Once the header flit gains access to a channel, the current message owns that channel until the tail flit passes through it and resigns ownership of the channel. If the header encounters a channel already in use, it is blocked until the channel is freed. In wormhole routing, if the required output channel is busy, the message is blocked in place.Wormhole switching technique has been a popular switching technique in new generation multicomputer networks. The first commercial multicomputer to adopt wormhole routing was Ametek 2010, which used a 2D mesh topology. The Ncube-2, which uses a hypercube topology, has also adopted wormhole routing. The Intel Touchstone Delta and Inter Paragon use wormhole routing in a 2D mesh. Finally, MIT's research prototype J-machine uses wormhole routing in a 3D mesh.In wormhole routing, contiguous flits in a packet are always contained in the same or adjacent nodes of the network. This can cause difficulties, as possibility of deadlock arisesDeadlocks. A multicomputer network is said to be in a deadlock condition when no message can advance towards its destination. This situation can postpone message delivery indefinitely. Deadlock can occur if messages are allowed to hold some resource while requesting others.Deadlock is a catastrophic to a network. After a few resources are occupied by deadlocked packets, other packets block on these resources, paralyzing network operation. There are three strategies to prevent this situation: deadlock prevention, deadlock avoidance, and deadlock recovery. In the deadlock prevention, resources are granted to a packet in such a way that a request never leads to a deadlock. It can be achieved by reserving all the required resources before starting packet transmission. In the deadlock avoidance, resources are requested as a packet advances through the network. However, a resource is granted to a packet only if the resulting global state is safe. Finally, the deadlock recovery strategies are optimistic. Deadlock recovery strategies take no action to prohibit deadlock but detect the occurrence of deadlock and resolve the deadlock. This scheme is based on the observation that deadlock is very rare phenomenon in the real world. Almost all modern networks use deadlock avoidance.Multicast latency consists of three components, start-up latency, network latency and blocking latency. The start-up latency is the time incurred by the operating system when preparing a message for injection into the network. The network latency consists of channel propagation and router delays, while blocking latency accounts for delays due message contention over network resources, e.g. buffers and channels.In current generation machines, the start-up latency is the dominating factor in the cost of communication, being typically in the order of microseconds compared to the network latency, which is in the order of nanoseconds. Blocking latency, on the other hand, depends on the routing algorithm and the generated traffic, and consequently can vary widely depending on instantaneous traffic conditions. Therefore, our objective in this thesis is the design of efficient multicast path-based routing algorithms that can balances start-up latency with different network loads so as to achieve high parallelism and low communication latency over a range of network loads.This thesis, discussion is restricted to the 3-D mesh topology with Bi-directional channels. An m (rows) x n (columns) x r (layers) 3-D mesh comprises mnr nodes interconnected in a grid fashion. The 3-D mesh topology can be modeled as a graph M (V, E) in which each node in V (M) corresponds to a processor and each edge in E (M) corresponds to a communication channel. The mesh graph is formally defined below. Definition 3.1: An mxnxr non wraparound 3-D mesh graph is a directed graph M (V, E), where the following conditions exist:The mesh topology is asymmetric due to the absence of the wrap-around connections along each dimension. As a result, nodes may not be connected to the same number of neighbors; those at the corners, edges, and middle of the network have four and six neighbors respectively. In this system, the node consists of a processing element (PE) and router. The processing element contains a processor and some local memory. There are local channels used by the processing element to inject/eject messages to/from the network, respectively. Messages generated by the processing element are injected into the network through the injection channel. This study considers the All-Port router model where routers are able to relay multiple messages simultaneously provided that each incoming message requires a unique outgoing channel leading to a neighboring node, and that a node can simultaneously send and receive messages along all ejection and injection channels.In this thesis, four multicast path-based wormhole algorithm in general three-dimensional networks are proposed and studied. The four algorithms are shown to be deadlock-free.Our first introduced algorithm GTDBTPM is designed such that can send messages to any number of destinations within two start-up phases (in x, y directions) using binary search (in z direction); hence the name General Three-Dimension Binary Two-Phase Multicast. It is based on splitting the 3-D mesh network into set of layers. The 3-D mesh network can be shown as set of layers; each layer represents a 2-D mesh network. Each node in the mesh is represented by its integer coordinate (x, y, z). In fact, if we have mxnxr 3-D mesh, then we have r layers of mxn 2-D mesh. The z coordinate for the first layer is 0, for the second layer is 1, and for the last layer is r-1. The GTDBTPM algorithm divides network in z coordinate into two subnetworks N+z, and N-z. Subnetwork N+z contains all upper diagonal channels of source with addresses [(x, y, z), (x, y, z + 1)], and subnetwork N-z contains all lower diagonal channels of source with addresses [(x, y, z), (x, y, z - 1)].The second introduced algorithm GTDMPM is designed such that can send messages to any number of destinations within multiple start-up phases; hence the name General Tree-Dimension Multi-Phase Multicast. It based on splitting the network into six subnetworks, NxR, NxL, Nxs,yu, Nxs,yL, Nxs,ys,zu and Nxs,ys,zL. Subnetwork NxR contains all right horizontal channels of source with addresses [(x, y, z), (x + 1, y, z]; Subnetwork NxL contains all Left horizontal channels of source with addresses [(x, y, z), (x - 1, y, z)]; Subnetwork Nxs,yu contains all upper vertical channels of source with addresses [(x, y, z), (x, y + 1, z]; Subnetworks Nxs,yL contains all Lower vertical channels of source with addresses [(x, y, z), (x, y - 1, z]; Subnetworks Nxs,ys,zu contains all upper diagonal channels of source with addresses [(x, y, z), (x, y, z + 1] and Subnetworks Nxs,ys,zL contains all lower diagonal channels of source with addresses [(x, y, z), (x, y, z - 1].Our third introduced algorithm GTDTPM is designed such that can send messages to any number of destinations within two start-up phases; hence the name General Three-Dimension Two-Phase Multicast. It exploits the features of Hamiltonian paths to implement multicast in two message-passing steps, thus considerably reducing the effects of both network size and start-up latency. At the source node, GTDTPM algorithm divides the network into two subnetworks, NU and NL, where every node in NU has a higher label than that of the source node and every node in NL has a lower label than that of the source.The fourth introduced algorithm GTDSPM is designed such that can send messages to any number of destinations within six start-up phases; hence the name General Three-Dimension Six-Phase Multicast. It exploits the features of Hamiltonian paths to implement multicast in six message-passing steps. In a 3-D mesh, most nodes have outgoing degree 6 so up to six paths can be used to deliver a message, depending on the locations of the destinations relative to the source node. The only difference between GTDSPM algorithm and GTDTPM algorithm concerns message preparation at the source node, in which the destination sets DU and DL of the GTDTPM algorithm are further partitioned. The set DU is divided into three subsets, DU1 containing the nodes whose x coordinates are greater than to that of source, DU2 containing the nodes whose x coordinates are smaller than to that of source and the DU3 containing the remaining nodes in DU. The set DL is partitioned in a similar manner. The message is sent to the six sets simultaneously through the six output ports of source.A simulation study has been conducted that compares the performance of these multicast algorithms under dynamic network traffic conditions in a 3-D mesh. To evaluate the performance of the multicast schemes in an interconnection network, there are some parameters that must be considered: The injection rate, the multicast size, the message length, and the startup latency. The injection rate is the average interarrival time, the multicast size is the number of destination nodes, and the message length f is the number of flits in a message. The message startup latency ? includes the software overhead for buffers allocating, messages coping, router initializing, etc.We first give our assumptions to the parameters of system architecture in the simulations. All simulations were performed for a 6x6x6 3-D mesh. We examined the routing performance of our proposed schemes under various injection rate, multicast sizes, startup latencies, and message lengths. The source node and the destination nodes for each multicasting were randomly generated. The large message startup latency ? is set to be 100 ms, and the small message startup latency ? is 10 ms. The small message startup latencies were usually used for advanced network interface to improve the efficiency of latency time. For all of the multicasting, the message sizes of 1, 100, 1000 and 10000 flits were simulated.To compare the performance of our proposed multicast routing algorithms, the simulation program used to model multicast communication in a single channels 3-D mesh networks is written in VC++ and uses an event-driven simulation package, CSIM. CSIM allows multiple processes to execute in a quasiparallel fashion and provides a very convenient interface for writing modular simulation programs. The simulation program for multicast communication is part of a larger simulator, called MultiSim, which is designed to study large-scale multiprocessors. MultiSim consists of several components, all of which run within the CSIM package. This section describes the program and results obtained from it. All simulations were executed until the confidence interval was smaller than 5% of the mean, using 95% confidence intervals.The behaviors of these algorithms were compared using simulation. The results presented in this thesis indicate that when the ratio ? of the startup time over the propagation time is small, and when the message size is very small (1 flit), the GTDSPM algorithm performs better than the other algorithms when the injection rate increases. When the message size is medium (100 flits) both GTDTPM and GTDSPM algorithms perform better than the GTDMPM and GTDBTPM algorithms at high injection rate (when the load is greater than 100), then the GTDMPM and GTDBTPM algorithms turn to be more efficient for a small number of loads (the load is smaller than 100). When the message size increases however, the GTDMPM algorithm is best. These conclusions do not change too much when the startup time increases (100 times the propagation time).In fact, when experimental results on the average network latency is function of the number of destinations, (the message size and not the number of destinations) is the real cause of the performance degradation for the GTDTPM algorithm. The GTDTPM algorithm outperforms the other algorithms when the message size is medium (100 flits). When the message size is big (1000 flits), GTDMPM algorithm outperforms the other algorithms. These conclusions do not change too much when the startup time increases (100 times the propagation time).In conclusion, there is not a unique candidate to be the best-multicast algorithms on meshes, and the choice depends on many parameters. The final choice depends on traffic conditions, and network characteristics. From our results, we would suggest using GTDTPM algorithm when the message size is small to medium (1 to 100 flits), and using GTDMPM when message size is big (greater than 100 flits).
Keywords/Search Tags:Wormhole-Routed
PDF Full Text Request
Related items