Research On Compiler-assisted Cache Coherence For GPGPU

Posted on:2019-12-19

Degree:Master

Type:Thesis

Country:China

Candidate:C Q Zang

Full Text:PDF

GTID:2428330545453697

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the widespread applying of GPGPU based heterogeneous computing architectures in various fields,such as cloud computing,big data,deep learning,development of GPU architecture,more and more GPU cores,the problem of data correctness is becoming more and more prominent in GPU architecture.It is common to use a directory-based hardware cache coherence protocol to maintain the cache coherence in multiprocessor system.However,since the high parallelism of GPU architecture,using hardware cache coherence protocol will result in higher communication traffic,lager memory overhead and higher design complexity.Therefore,traditional cache coherence protocols designed for chip multiprocessors cannot be directly adopted in GPUs due to the massively parallel architecture.Modern GPUs avoid the cache coherence issue via not caching global data in the private L1 cache.However,bypassing L1 cache may slow down cache-sensitive GPU applications,and potentially leads to excessive off-chip main memory accesses which affects the overall system performance of heterogeneous computing platforms.Many GPU applications have higher performance with using Private L1 cache,however,blindly loading data into the L1 cache causes a cache no-coherent.In this work,based on the research that Many GPU kernels have predictable memory data access patterns,we propose a static program analysis which enable GPU kernels to conservatively load global data in the private L1 cache which are guaranteed to have no coherence issue.We have integrated the proposed framework into the Nvidia NVCC compiler and utilized the cache operators in the PTX ISA to automatically generate sound and high performance executables without any hardware level coherence protocols.Experimental results on off-the-shelf GPU platforms show an average of 1.38x,1.26x,1.24x performance speedups for cache sensitive application on Jetson TX1 Jetson TX2,GTX1060 platforms,respectively.Meanwhile,total L2 cache transaction are reduced by 31%,31%,48%on the above mentioned platforms,respectively.

Keywords/Search Tags:

GPU, cache coherence, Static Program Analysis

PDF Full Text Request

Related items

1	Rcsarch And Design Of Cache Coherence For Mu11i-core Processors
2	Analysis And Implementation Of Cache Coherence Protocols For CMP
3	Application Research Of Data Cache Technology In MIS
4	Research On Program Static Analysis
5	Research On Cache Coherence Of Multi-core Microprocessor
6	Assessment of cache coherence protocols in shared-memory multiprocessors
7	Research And Development Of Cache Coherence In Symmetric Multi-Processor
8	Study And Implementation On Scalable Hierarchical Cache Coherence Directory Scheme
9	Improving Program Cache Performance By Program Analysis And Optimization
10	Design And Implementation Of A Static Bug Detection Tool For PLC Program