Research On Key Technologies For Cache Power And Performance Optimization On Many-core Heterogeneous Architecture

Posted on:2015-08-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z Zheng

Full Text:PDF

GTID:1108330509461017

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Heterogeneity and many-core are the tread of current processor design. Heterogeneity means cores with different ISAs are integrated into a single chip. Many-core refers to more than thousands of cores within a single chip. Heterogeneous many-core systems gain its popularity in high-performance computing as its high performance and energy efficiency. However, power and memory wall are impeding the performance improvement of such processors. As the improvement of manufacture technology, more transistors can be placed on one chip. But, part of the transistors must be powered off because of the power consumption and heat dissipation problem. In addition, the speed of main memory still lags behind the core speed, which becomes the performance bottleneck.Caches have been playing the role of alleviating memory wall problem by exploiting data locality. However, caches consume a large portion of the chip area and up to 50%of the chip power. This requires both cache power optimization and cache performance optimization, to lower the chip power and improve performance. Optimization on cache power and performance is extremely meaningful for heterogeneous many-core systems.This thesis deals with cache power and utilization of limited cache resources. We focus on two problems:(1) energy wastes of the level-one cache because of the parallel access and(2) cache allocation problem when a cache is shared by thousands of threads. The main contributions are as follows:1. We propose a region-based way partitioning on level one cache for low power.According to the access patterns in stack region and non-stack region, we propose region-based way partitioning on level one cache. The access to the stack and nonstack are isolated and directed to different ways of the cache. Thus, each access will probe less cache ways than usual, reducing access energy. In addition, this way partitioning can be configured dynamically to adapt to different programs, avoiding the potential performance degradation of static partitioning. Experiments for a 4-way associative cache shows that our method can save about 28% of the cache power consumption.2. We propose tag check elision to save cache access power. As to the same problem of parallel access, we propose to access the recorded cache line directly without tag check and parallel access to save power. For a cache line, we record the accessing base register and corresponding offset bound. When accessing memory with the same base register, a bound check is performed to check if its address is in the recorded cache line. If it is in the same cache line, the data can be read directly from the recorded cache line without tag check and TLB access to save more power.Experimental results show that this approach can save 30% and 67% of the cache and DTLB dynamic energy, respectively.3. We propose dynamic cache allocation for many-core architecture. Low utilization of cache in many-core architecture, like GPGPU, is caused by the disparity of the cache capacity and thread number. To improve the cache utilization, we propose a stochastic cache allocation scheme. For each cache line, the requested PC and the reuse information are kept. Instructions are allocated cache lines with the reuse information. Low cache line reuse leads to low probability of cache allocation.This allocation decision improves the cache utilization and avoids over-demanding of memory access, improving performance. Up to 2.5X of speedup can be achieved under this allocation mechanism.4. We propose concurrency allocation based on cache performance. Caches and memory bandwidth in GPGPUs cannot support thousands of threads running concurrently. We propose a cache and concurrency allocation based on access patterns.Instructions are categorized according to the access patterns. Localities of instructions are analyzed using access pattern and cache miss rate. Warp that can use the level one cache and Warps that can run are determined using instruction locality and cache capacity. As a result, memory bandwidth and computational resources are better utilized, boosting performance. Experimental result show that this approach can improve performance by up to 3X, with 54% on average.

Keywords/Search Tags:

Heterogeneous system, Many-core, Cache, Power-efficiency, Performance

PDF Full Text Request

Related items

1	Research On Cache Optimization Technology Based On CPU-GPU Heterogeneous Architecture
2	Research On Shared Cache Management Technology In Heterogeneous Multi-core Environment
3	Research On Performance Optimization Of Cache Replacement Algorithm For Herogeneous Multi-core Systems
4	The Design Of Shared L2-Cache Structure Based On Heterogeneous Multicore System
5	Performance And Power Optimization In Heterogeneous Multi-core Embedded Systems
6	Research On Task Migration And Scheduling Algorithm For Heterogeneous Multi-core Processors
7	Performance Reasearch And Power Optimization In Heterogeneous Wireless Access Networks
8	Designing heterogeneous many-core processors to provide high performance under limited chip power budget
9	The Optimization Design And Verification Of High Performance Secondary Cache In Heterogeneous Multi-core DSP
10	Research On Low Power Algorithm For Heterogeneous Multi-core System