Font Size: a A A

Data Sharing Optimization On CPU-GPGPU Shared Last Level Cache System

Posted on:2019-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:L C YuFull Text:PDF
GTID:1368330548477375Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As CPU and GPGPU are used in many different environments,their unique features and strengths are exploited.To take advantage of both of them for more comprehensive adoption,heterogeneous systems includ-ing both CPU and GPGPU are brought into focus.Among them,fused CPU and GPGPU with shared last level cache(LLC)enables fine-grained interaction.However,due to the inefficient in data sharing caused by differ-ent memory access patterns,simply connecting CPU and GPGPU with the shared LLC cannot fully exploit the computing power of the heterogeneous system.This work first characterizes the memory access patterns when CPU and GPGPU exchange data via the shared LLC.The results discovery that simple cache management policy can hurt the data sharing efficiency,and leads to cache thrashing with unnecessarily stashing intermediate data into memory,which wastes memory bandwidth.Meanwhile,the traditional data sharing paradigm requires one processor waiting until the other provides its output,whose latency cannot be easily covered.This work introduces the shared LLC buffer,which provides data exchange in fixed-size elements and managed with hardware in LLC.Element-atom data organization is further proposed to eliminate the limitation of element size,and enables out-of-order access from GPGPU.Back memory is added to the LLC buffer to prevent deadlock.And low overhead global synchronization is also enabled in the LLC buffer.With data sharing in LLC,algorithms on CPU and GPGPU usually prefer different optimal data layouts due to their different data localities,and result in data layout conflict.Current layout converting methods po-tentially contaminate private caches of the converting processor or incur overhead of either converting code execution or layout management.This work provides architecture supported data layout remapping in the LLC.It allows algorithms in both processors to access their data in their optimal layouts and exploits their private caches.A more.flexible converting by a programmable remapping controller is used instead of the extra converting code running on either CPU or GPGPU.GPGPU code contains a lot of affine computations,where threads ex-ecute the same code with input of their thread IDs.These are considered redundant computations,which cost system energy and lead to massive memory access requests that are not scheduler-friendly.This work presents decoupled memory access for GPGPU.CPU provides affine computation ar-guments to the LLC,who generates actual memory requests and issue them to memory on behalf of all the GPGPU threads accordingly.The data replied are further coalesced and pushed back to the threads.As a result,the decoupled memory access eliminates most of the address calculation and memory accessing code,as well as improves the overall performance.This work evaluates the above methods with a simulator.The results show the shared LLC buffer achieves speedup of 48%compared with tradi-tional data sharing method.The global synchronization time of CPU and GPGPU are further reduced to 21%and 38%,respectively.With the LLC layout remapping,the run time of all benchmarks is reduced to 69%in av-erage.When compared with CPU and GPGPU layout converting,the LLC layout remapping reduced the converting time to 58%and 46%,respec-tively.The decoupled memory access reduces the average run time to 48%,and reduces the instructions executed by GPGPU to 84%.In conclusion,the optimization methods proposed in this work improve the data sharing between CPU and GPGPU with shared LLC in several aspects.
Keywords/Search Tags:heterogeneous multi-processor, shared last level cache, memory access optimization, GPGPU
PDF Full Text Request
Related items