Font Size: a A A

An Innovative Agile Switching System: Contention-Tolerant Crossbar Switch

Posted on:2011-07-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:G N QuFull Text:PDF
GTID:1100360305453673Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Increasing applications of Internet lead to requirement of faster and more scalable switches and routers. To simplify scheduling and queueing management, the variable size packets are segmented at input ports into fix-size cells and reassembled at output ports. Crossbars are extensively used in IP routers because of their simple and fast nonblocking routing mechanism. Depending on how packets are buffered, there are four types of switches, namely input-queued (IQ), output-queued (OQ), combined input and output queued (CIOQ), and combined input-crosspoint queued (CICQ) switch, respectively. CICQ switch which has crosspoint buffers is also called buffered crossbar, and the other three switches without crosspoint buffers are called unbuffered crossbars.OQ switches, which buffer packets in output ports, are powerful in terms of providing QoS, but they remain to be a theoretical model because their requirement that the memory speed is N (N is the number of ports of the switch) times faster than the link rate is not practical. In an IQ switch, packets are queued in input ports, and the memory operates at the same speed as the external link rate. To overcome head-of-line (HOL) blocking and enable higher throughput, input buffers in an IQ switch are arranged as virtual output queues (VOQs). It is well known that IQ switches cannot ensure quality of services (QoS) that are quantified by guaranteed rate, delay, etc. CIOQ switches buffer packets in both input ports and output ports, and require the memory speed to be S times faster than the link rate,where S is the speedup in the range of 1 < S < N.For output contention problem, the performance of an IQ or CIOQ cell switch depends on how cells are scheduled for switching through the switch, which determines contention-free cells for each time slot. Modeling I/O connection requests as a bipartite graph, and the scheduling problem is reduced to a matching optimization problem, including maximum matching and maximal matching. Even though maximum matching algorithms are able to provide high throughput and low delay, the time complexity for finding a maximum match is too high for practical use. Maximal matching algorithms can be implemented as sequential and parallel. Sequential matching algorithms sequentially allocate the unmatched output ports on the requesting matrix diagonal by diagonal. Parallel matching algorithms use request-grant-accept or request-accept information exchange cycle to achieve matching between inputs and outputs. Compared with maximum matching, maximal matching algorithms are more practical. For a switch of N input ports and N output ports, however, these schedulers require 2N arbiters working in O(logN) Request-Grant-Accept (RGA) iterations to obtain a maximal matching between inputs and outputs. Though implemented in hardware, these schedulers are considered too slow with too high cost for high-speed networks.A recent trend of improving the crossbar switch is to reduce cell scheduling cost and speed up cell scheduling process by introducing buffers in crosspoints. A scheduler operates in two phases. In the first phase, each input port selects a cell to place into a crosspoint buffer in its corresponding row, and in the second phase, each output port selects a crosspoint in its corresponding column to take a cell from. Input (resp. output) ports operate independently and in parallel in the first (resp. second) phase, eliminating a single centralized scheduler. Such crossbars with crosspoint buffers have the major disadvantages of consuming chip area for O(N2) on-chip buffers and requiring O(N2) scheduling hardware cost.Is it possible to build an unbuffered crossbar switch with simple, fully distributed (and parallel) schedulers without O(N2) control circuit complexity? We give an affirmative answer to this question by proposing an innovative agile crossbar switch architecture called contention-tolerant crossbar, which is denoted by CTC(N). Unlike the conventional crossbar and the crossbar with crosspoint buffers, which require complex hardware resolvers to grant one out of multiple output requests, CTC(N) can tolerate output contentions by a pipelining mechanism, with pipeline stages implemented as buffers in the input ports. These buffers are used to decouple the scheduling task into N independent parts in such a way that N schedulers are able to be located in the N input ports, and operate independently and in parallel. Without using arbiters and crosspoint buffers that require additional chip area, the CTC(N) switch is more scalable.As the first step, we investigate CTC(N) with FIFO queues in its input ports. From the property of CTC(N), the HOL cell of each input queue will be transmitted without delay so that HOL blocking problem does not exist in CTC(N). To evaluate switch throughput, we developed a tagged output queueing network model for CTC(N) using queueing theory and show that, under Bernoulli i.i.d. uniform traffic, CTC(N) without internal speedup has worst-case throughput of 63%. Simulation results validate our theoretical results. We consider three approaches to improve the throughput: (1) introduce internal speedup; (2) decrease arrival rate from outside for each queue; (3) decrease arrival rate from upstream.Based on the first method, we prove that CTC(N) with single FIFO queue in each input port and with internal speedup 2 achieves 100% throughput under Bernoulli i.i.d. uniform traffic. Simulation results are used to validate our analysis.By the second method, we proposed Multi-layer CTC(N), MCTC(N) for short, with parallel switch technique. MCTC(N) consists of k identical CTC(N) layers, and each CTC(N) layer has its own inputs and outputs with separated buffers. Each input port of switch is connected to the corresponding input of each layer via an input switching element (ISE). Using these ISEs, cells from outside are evenly distributed over all layers. Once a cell is injected into a component CTC(N), it will remain in this layer until it reaches its destination output in of the layer. All outputs of CTC(N) layers are connected to output port of switch by output switching element (OSE). With load balancing and parallel switching, offered load from outside for input of each layer are reduced, which leads to high throughput. Using the tagged output queueing network model, we prove that, under Bernoulli i.i.d. uniform traffic, the smallest layer number which makes MCTC(N) with single FIFO queue in each input achieve 100% switch throughput is 2. We develop a queueing network model for one layer of k layers MCTC(N). By solving the corresponding Discrete Time Markov Chain, we analyze the buffer size and mean cell delay of k layer MCTC(N) under Bernoulli i.i.d. uniform traffic assumption. Our theory model and analysis are confirmed by simulation results. This queueing network model can be further explored to analyze output queueing in MCTC(N).Due to the distributed control property of CTC(N), schedulers in different input ports operate independently. Scheduling algorithms in all input ports can be different, and these algorithms are collectively called non-uniform algorithms. If all the algorithms in different input ports are the same, this set of algorithms is called uniform algorithms. We explore approach (3) by designing a uniform fully distributed scheduling algorithm scheme called Staggered Scheduling Algorithm Scheme. The scheduler in input port i is denoted as Si, which consists of two sub-schedulers, i.e. the primary sub-scheduler PSi and the secondary sub-scheduler SSi. In each time slot, (1) PSi generates a unique number, which is a VOQ index. If this VOQ is empty, PSi returns false. Otherwise, PSi returns the index of this VOQ as its scheduling results. (2) SSi maintains a list of non-empty VOQ indices. According to some algorithms, SSi chooses a VOQ index from this list as its scheduling result. If the list is empty, SSi returns false. The two sub-schedulers operate in parallel. The scheduler Si chooses the scheduling result of PSi first. If PSi returns false, Si turns to consider the scheduling result of SSi. If both PSi and SSi return false, no VOQ will be serviced at current (or next) time slot. The primary scheduling processes of all input ports aim at finding a set of cells with minimum number of output contentions, and the secondary scheduling process tries to maximize the utilization of input port. According to different traffic, several scheduling algorithms can be designed for both sub-schedulers. With zero knowledge of other input ports, the fully distributed schedulers"smartly"cooperate with each other so that improved performance can be attained. Using three possible algorithms, we show that Staggered Scheduling Algorithm Scheme can achieve high performance by simulation.This investigation on CTC(N) has opened new opportunities of constructing practical packet switches in IP routers. Many aspects of CTC(N) deserve further investigations. For example, as a preliminary research, our analysis is based on Bernoulli i.i.d. uniform arrival traffic. What is the minimum speedup needed to achieve 100% throughput under other traffic? Our architecture has the problem of switching cells through the switch out of their original order. Then, how to minimize the cell out-of-order problem? 100% throughput is a necessary condition for output queuing. Fully distributed control of CTC(N) provides a wide range of opportunities for designing QoS assuring cell scheduling algorithms.
Keywords/Search Tags:Contention-tolerant, crossbar, switch, parallel switching, scheduling algorithm
PDF Full Text Request
Related items