Font Size: a A A

Research On Efficient GPGPU Networks-on-Chip For Coprocessors

Posted on:2018-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:W J LiuFull Text:PDF
GTID:2428330569499056Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
GPGPU is now widely used in high performance computing and scientific computing because of its excellent large-scale parallel computing power.With the bandwidth of integrated DRAM and the number of computing units on GPGPU increasing,GPGPU has required for higher performance of Networks-On-Chip(NoC).Current research on NoC,however,can not meet the communication requirements of GPGPU and other GPGPU-like coprocessors.On the one hand,GPGPU is actually bandwidth-sensitive but not latency-sensitive,which is different from general purpose processors.This property of GPGPU makes it useless to improve system-level performance of GPGPU by reducing NoC delay—that is to say,optimizing memory access performance is more critical.On the other hand,most of the current NoC designs focus on network-level performance improvement but disregard system-level performance.Those network-level performance metrics they valued can reflect the characteristics of the network itself but are not directly related to overall system performance.Therefore,it makes more sense to research on NoC from system-level perpective.Motivated by the urgent need for research on GPGPU NoC,we firstly explore the communication mode of GPGPU and bottleneck problems faced by GPGPU NoC research.Based on the results found in the exploration,we then propose several design schemes,which are related to memory access scheduling,NoC arbitration mechanism,virtual channel partition and router microarchitecture design,respectively.Finally,we propose an efficient GPGPU NoC framework based on design schemes mentioned above.Specifically,main research work and according designs of this paper are listed as follows:(1)Proposing a cost-efficient memory access scheduling schemeGoing through current designs of memory controllers(MCs),we find that most of them employ out-of-order scheduling to maximize row access locality and in turn maximize memory access efficiency.However,the high performance of out-of-order scheduling is obtained at the expense of high area and power overhead.This overhead is disregarded in general purpose processor design but will become a more severe problem with the increasing number of computing units integrated in GPGPUs.Our solution of this problem is divided into two stages.In the first stage,we design a same source first(SSF)arbitration to improve the transmission mechanism of NoC,which influences much on row access locality of memory requests.In the second stage,we propose a simple and low-overhead Batched-FIFO memory access scheduling scheme.This alternative scheme based on in-order scheduling is more effective and scalable in GPGPU NoC.(2)Proposing a terminal node-based static virtual channel partitioningVirtual channels are widely used in the current on-chip networks because they can improve the buffer utilization and avoid network deadlocks.Analyzing the basic principle of most virtual channel partitioning,we find that they always allocate the head one of empty virtual channel queues to the request.This partitioning mechanism can lead to datapath diversity,which will reduce row access locality of memory requests.So we introduce the terminal node information of memory requests into virtual channel partitioning in this paper to avoid the disturbance of datapath dirversity.This novel virtual channel partitioning mechanism can preserve DRAM row access locality as well as exploiting advantages of the virtual channels,leading to the improvement of system performance at last.(3)Proposing a multi-port router microarchitecture oriented to memory nodesOn-chip network of GPGPU is divided into request and reply networks.When it comes to reply network,there is usually a large number of read reply packets,which packets require much more network bandwidth because they carry massive data requested by compute nodes.As a result,the heavy reply traffic from memory nodes to compute cores causes a network bottleneck and further degrades overall performance.To solve this problem,we propose an appropriate router microarchitecture only connected to memory nodes by increasing the local Injection/Ejection ports.This design scheme can alleviate the heavy burden on memory nodes with little change of the system structure,which makes it compatible to other designs discussed above.(4)Integrating an efficient design framework for GPGPU NoC.The general objective of this paper always focuses on efficient designs of on-chip networks in GPGPUs.So design schemes proposed above target low power comsumption and high performance respectively and have perfect compatibility with each other.After achieving research goals detailed above,we integrate them into an overall design framework for efficient GPGPU NoC architecture.Experimental results on memory-intensive applications show that overall system performance of proposed GPGPU NoC increases by 10.5%,power consumption falls by 20% and the energy efficiency ratio grows by 36.9% comparing to FR-FCFS.Therefore,GPGPU NoC architecture proposed in this paper achieves high energy efficiency ratio on memory-intensive applications and can hold the performance and power consumption of non-memory intensive applications.Further more,our design scheme has much lower complexity of system design and implementation than on-chip networks based on FRFCFS.
Keywords/Search Tags:GPGPU, Networl-on-Chip, Memory Access Scheduling, Virtual Channel Partitioning, Multiport Router
PDF Full Text Request
Related items