Improving Scalability of Chip-MultiProcessors with Many HW ACCelerator

Posted on:2018-02-26

Degree:Ph.D

Type:Dissertation

University:Northeastern University

Candidate:Teimouri, Nasibeh

Full Text:PDF

GTID:1448390002498044

Subject:Computer Engineering

Abstract/Summary:

Breakthrough streaming applications such as virtual reality, augmented reality, autonomous vehicles, and multimedia demand for high-performance and power-efficient computing. In response to this ever-increasing demand, manufactures look beyond the parallelism available in Chip Multi- Processors (CMPs), and more toward application-specific designs. In this regard, ACCelerator (ACC)-based heterogeneous CMPs (ACMs) have emerged as a promising platform.;An ACMP combines application-specific HW ACCelerators (ACCs) with General Purpose Processor(s) (GPP) onto a single chip. ACCs are customized to provide high-performance and power-efficient computing for specific compute-intensive functions and GPP(s) runs the remaining functions and controls the whole system. In ACMP platforms, ACCs achieve performance and power benefits at the expense of reduced flexibility and generality for running different workloads. Therefore, manufactures must utilize several ACCs to target a diverse set of workloads within a given application domain.;However, our observation shows that conventional ACMP architectures with many ACCs have scalability limitations. The ACCs benefits in processing power can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. The primary source of the resources bottlenecks stems from ACCs data access and orchestration load. Due to very loosely defined semantics for communication with ACCs, and relying upon general platform architectures, the resources bottlenecks hamper performance.;This dissertation explores and alleviates the scalability limitations of ACMPs. To this end, the dissertation first proposes an analytical model to holistically explore how bottlenecks emerge on shared resources with increasing number of ACCs. Afterward, it proposes ACMPerf, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs' benefits.;Then, to open a path toward more scalable integration of ACCs, the dissertation identifies and formalizes ACC communication semantics. The semantics describe four primary aspects: data access, synchronization, data granularity, and data marshalling.;Considering our identified ACC communication semantics, and improving upon conventional ACMP architectures, the dissertation proposes a novel architecture of Transparent Self- Synchronizing ACCs (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications. The proposed TSS adds autonomy to ACCs to locally handle the semantic aspects of data granularity, data marshalling and synchronization. It also exploits a local interconnect among ACCs to tackle the semantics aspect of data access. As TSS gives autonomy to ACCs to self-synchronize and self-orchestrate each other independent of the processor, thereby enabling finest data granularity to reduce the pressure on the shared memory. TSS also exploits a local and reconfigurable interconnect for direct data transfer among ACCs without occupying DMA and communication fabric.;As a result of reducing the overhead of direct ACC-to-ACC connections, TSS delivers more of the ACCs' benefits than that of conventional ACMP architectures: up to 130x higher throughput and 209x lower energy, all as results of up to 78x reduction in the imposed load to the shared resources.

Keywords/Search Tags:

Conventional ACMP architectures, ACC, Accs, Shared resources, TSS, Scalability, Processor, Data

Related items

1	The Design Of Distributed Embedded ACCS HIL Simulation Platform And CAN-Bus Communication Research
2	Efficient Use of Execution Resources in Multicore Processor Architectures
3	Designing graphics architectures around scalability and communication
4	Research On Key Technology Of High-efficient Shared Memory System In Network Processor Based On MPSoC
5	Modeling interprocess shared-cache contention on multicore architectures with applications in virtual machine CPU scheduling
6	Research And Optimization Of Scalability In Distributed Shared Memory
7	Study And Implementation Of High Performance Parallel Hierarchy Stream Memory System
8	Performance portability and scalability in shared-address-space multiprocessing
9	Research On Parallel Performance Of Operating System In Multicore Environment
10	Modeling Of Shared Cache Memory Access Behavior Based On Artificial Neural Network