Key Research Issues Of Memory Architecture For Three Dimensional Multi-Core Processors

Posted on:2016-08-03

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y A Zhang

Full Text:PDF

GTID:1228330461956561

Subject:Microelectronics and Solid State Electronics

Abstract/Summary:

PDF Full Text Request

With more and more transistors being integrated into a single chip, there are two distinguishing development trends in digital integrated circuit (IC) area:the movement from single-core processors with instruction-level parallelism to multi-core processors which utilize thread-level and data-level parallelism, the trend from two-dimensional (2D) ICs to three-dimensional (3D) ICs with multiple device layers. Memory subsystem is one of the most important components for a multi-core processor. On-chip memories not only take a large part of chip area and power budget, but also have major affection on performance of multi-core processors. Along with the growth of number of cores that integrate on a single chip and the improvement of the processing power of a core, more and more data will be required by the parallel cores. The well-known "memory wall" issue in single-core processors will still exist in multi-core processor context, and may become even severer. Combining the multi-core processor with 3D integration, caches and main memories can be stacked on top of processor layers. The high density inter-layer interconnection can help to widen the memory bandwidth, decrease the memory accessing latency and hence improve the performance of multi-core processors.3D integration technology is very promising to mitigate the "memory wall" problem of multi-core processors.This dissertation aims to study the key issues of memory subsystems for two kinds of representative 3D multi-core processors, which are general purpose chip multi-processors (CMPs) and general purpose (many core) graphics processing units (GPGPUs). The thesis trys to find the system bottleneck and propose new designs to improve the overall performance for 3D multi-core systems.The dissertation contains the following main parts:In this dissertation, we study the performance improvement by stacking multiple layers of last level caches (LLCs) and DRAM main memories for 3D CMPs. Towards mesh based 3D multi-core network-on-chip, a tightly mixed non-uniform cache architecture (TM-NUCA) is proposed. The 3D CMP with TM-NUCA can improve upto 31.71% performance and reduce network power consumption upto 15.74% compared to the baseline 3D CMP.The non-uniform memory access (NUMA) architecture is now possible for 3D CMPs with stacked main memories. We present a scalable NUMA architecture towards 3D CMPs. The on-chip main memories are partitioned into private memory, shared memory and other specil purpose memories and are spread in the processor nodes. The access time of private memories is not changed with the growth of cores and the delay to shared memories depends on the distant between the core and the shared memory. To support the parallel access to the shared memories, schemes of synchronization and memory consistency are discussed. The experimental results show that the proposed distributed NUMA architecture can efficiently support the parallel access to the memory and hence provide good speedups for 3D CMPs.Caches utilize the spatial locality and temporal locality of data, which can minish the access of slower memories and hence decrease the average memory access latency. However, using private caches may incur cache inconherence issue for multi-core processors. Cache coherence is one of the most important topics for CMPs. In this dissertation, we implement a software-hardware hybrid cache coherence scheme based on microcode. Then we propose a cluster based hierarchical cache coherence towards large scale 3D CMPs. The analysis indicats that compared to the flat directory based cache coherence, the proposed cluster based hierarchical cache coherence has improved protocol communication cost and smaller directory size.The GPGPU is a promising kind of many-core processors both in current time and the future. This dissertation quantitively analyzes the effort of memory access latency on performance of GPGPUs. A 3D GPGPU with stacking DRAM main memories is proposed. The experimental results show that compared to the baseline 2D GPGPU, the 3D stacking GPGPU can provide up to 124.1% and on average 55.8% performance improvement. The memory subsystem of the 3D GPGPU is more power efficient than that of the 2D GPGPU. The temperature of the proposed 3D GPGPU for the test cases is in the range between 60 and 85 degree centigrade, which is acceptable for 3D ICs. To the best of our knowledge, this work is the first to study and assess the impact of 3D stacking main memory atop GPGPUs in performance, power and temperature.The GPGPU applications have diverse reqirements on local memories, e.g. register, shared memory and L1 cache. In this dissertation, a reconfigurable local memory (RLM) architecture is proposed for 3D GPGPUs. The reconfiguration memory can be configured as register, shared memory or L1 data cache dynamically in a per-kernel granularity. The experimental results show that the proposed RLM can efficiently ultilize the extended local memory. Compared to a baseline 3D GPGPU,3D RLM-GPGPU provides up to 52.22% and on average 15.87% performance improvement.With the growth of local memory for a 3D GPGPU, some applications may have imbalanced workloads on different streaming multiprocessors (SMs). To handle this imbalanced dispatching issue, a greedy-lazy hybrid cooperative thread arrays (CTA) dispatching scheme is proposed. The proposed dispatching scheme has two phases: greedy phase and lazy phase. The greedy phase is used to fully utilize the local memories. When the imbalanced workloads assignment may happen, the greedy-lazy hybrid CTA dispatching scheme switches to lazy mode to mitigrate the potential imbalanced workload. The experimental results show that the proposed greedy-lazy hybrid CTA dispatching scheme can efficiently facilitate workload balancing and hence prompt overall performance of 3D GPGPUs.

Keywords/Search Tags:

Three Dimensional Integrated Circuit (3D IC), Chip Multi-Processor (CMP), 3D stacking cache, Non-Uniform Cache Architecture (NUCA), 3D stacking main memory, Non-Uniform Memory Access (NUMA), cache coherence, Network on Chip (NoC)

PDF Full Text Request

Related items

1	Optimizations Of Memory Subsystem For Chip Multiprocessor Systems
2	Research On Shared Cache Access Fairness For Many-Core Processor
3	Cache Coherence Techniques For Chip Multiprocessor Architecture
4	Smart Directory Cache For Multi-Many-Core Systems
5	Research On Key Technologies Of CC-NUMA Based Memory Architecture
6	Key Techniques Research Of Memory In Homogeneous General Purpose Stream Processor
7	The NUMA page migration/page replication ASIC {lcub}NPMR{rcub}: A chip design to improve memory system performance in a Non-Uniform Memory Access (NUMA) multiprocessor system architecture
8	Research And Implementation Of The Cache Coherence Protocol For The Large Scale System Of The SMP-based CC-NUMA Category
9	Research On Cache Coherence Protocols Based On Data Sharing Characteristics
10	Analysis And Implementation Of Cache Coherence Protocols For CMP