Font Size: a A A

Research On Key Technologies For Cache Power And Performance Optimization On Many-core Heterogeneous Architecture

Posted on:2015-08-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z ZhengFull Text:PDF
GTID:1108330509461017Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Heterogeneity and many-core are the tread of current processor design. Heterogeneity means cores with different ISAs are integrated into a single chip. Many-core refers to more than thousands of cores within a single chip. Heterogeneous many-core systems gain its popularity in high-performance computing as its high performance and energy efficiency. However, power and memory wall are impeding the performance improvement of such processors. As the improvement of manufacture technology, more transistors can be placed on one chip. But, part of the transistors must be powered off because of the power consumption and heat dissipation problem. In addition, the speed of main memory still lags behind the core speed, which becomes the performance bottleneck.Caches have been playing the role of alleviating memory wall problem by exploiting data locality. However, caches consume a large portion of the chip area and up to 50%of the chip power. This requires both cache power optimization and cache performance optimization, to lower the chip power and improve performance. Optimization on cache power and performance is extremely meaningful for heterogeneous many-core systems.This thesis deals with cache power and utilization of limited cache resources. We focus on two problems:(1) energy wastes of the level-one cache because of the parallel access and(2) cache allocation problem when a cache is shared by thousands of threads. The main contributions are as follows:1. We propose a region-based way partitioning on level one cache for low power.According to the access patterns in stack region and non-stack region, we propose region-based way partitioning on level one cache. The access to the stack and nonstack are isolated and directed to different ways of the cache. Thus, each access will probe less cache ways than usual, reducing access energy. In addition, this way partitioning can be configured dynamically to adapt to different programs, avoiding the potential performance degradation of static partitioning. Experiments for a 4-way associative cache shows that our method can save about 28% of the cache power consumption.2. We propose tag check elision to save cache access power. As to the same problem of parallel access, we propose to access the recorded cache line directly without tag check and parallel access to save power. For a cache line, we record the accessing base register and corresponding offset bound. When accessing memory with the same base register, a bound check is performed to check if its address is in the recorded cache line. If it is in the same cache line, the data can be read directly from the recorded cache line without tag check and TLB access to save more power.Experimental results show that this approach can save 30% and 67% of the cache and DTLB dynamic energy, respectively.3. We propose dynamic cache allocation for many-core architecture. Low utilization of cache in many-core architecture, like GPGPU, is caused by the disparity of the cache capacity and thread number. To improve the cache utilization, we propose a stochastic cache allocation scheme. For each cache line, the requested PC and the reuse information are kept. Instructions are allocated cache lines with the reuse information. Low cache line reuse leads to low probability of cache allocation.This allocation decision improves the cache utilization and avoids over-demanding of memory access, improving performance. Up to 2.5X of speedup can be achieved under this allocation mechanism.4. We propose concurrency allocation based on cache performance. Caches and memory bandwidth in GPGPUs cannot support thousands of threads running concurrently. We propose a cache and concurrency allocation based on access patterns.Instructions are categorized according to the access patterns. Localities of instructions are analyzed using access pattern and cache miss rate. Warp that can use the level one cache and Warps that can run are determined using instruction locality and cache capacity. As a result, memory bandwidth and computational resources are better utilized, boosting performance. Experimental result show that this approach can improve performance by up to 3X, with 54% on average.
Keywords/Search Tags:Heterogeneous system, Many-core, Cache, Power-efficiency, Performance
PDF Full Text Request
Related items