Font Size: a A A

Research On Compiler-assisted Cache Coherence For GPGPU

Posted on:2019-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:C Q ZangFull Text:PDF
GTID:2428330545453697Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the widespread applying of GPGPU based heterogeneous computing architectures in various fields,such as cloud computing,big data,deep learning,development of GPU architecture,more and more GPU cores,the problem of data correctness is becoming more and more prominent in GPU architecture.It is common to use a directory-based hardware cache coherence protocol to maintain the cache coherence in multiprocessor system.However,since the high parallelism of GPU architecture,using hardware cache coherence protocol will result in higher communication traffic,lager memory overhead and higher design complexity.Therefore,traditional cache coherence protocols designed for chip multiprocessors cannot be directly adopted in GPUs due to the massively parallel architecture.Modern GPUs avoid the cache coherence issue via not caching global data in the private L1 cache.However,bypassing L1 cache may slow down cache-sensitive GPU applications,and potentially leads to excessive off-chip main memory accesses which affects the overall system performance of heterogeneous computing platforms.Many GPU applications have higher performance with using Private L1 cache,however,blindly loading data into the L1 cache causes a cache no-coherent.In this work,based on the research that Many GPU kernels have predictable memory data access patterns,we propose a static program analysis which enable GPU kernels to conservatively load global data in the private L1 cache which are guaranteed to have no coherence issue.We have integrated the proposed framework into the Nvidia NVCC compiler and utilized the cache operators in the PTX ISA to automatically generate sound and high performance executables without any hardware level coherence protocols.Experimental results on off-the-shelf GPU platforms show an average of 1.38x,1.26x,1.24x performance speedups for cache sensitive application on Jetson TX1 Jetson TX2,GTX1060 platforms,respectively.Meanwhile,total L2 cache transaction are reduced by 31%,31%,48%on the above mentioned platforms,respectively.
Keywords/Search Tags:GPU, cache coherence, Static Program Analysis
PDF Full Text Request
Related items