Font Size: a A A

The Optimized Design Of GPGPU On-chip Memory System

Posted on:2018-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:F F FanFull Text:PDF
GTID:2428330590977644Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,GPGPU has become a popular platform for general purpose computing.GPGPU shows powerful computing ability by leveraging thousands of threads.However,massive threads also make big pressure to some GPGPU on-chip memory resources.Especially,many threads often compete in the small sized first level data(L1D)cache,which leads to severe cache thrashing problem and further hurts the GPGPU performance.At the same time,some GPGPU on-chip memory resources are facing serious under-utilization.For example,there are many unoccupied registers and shared memory during runtime,which leads to serious resource waste.In this paper,we explore the optimized design of GPGPU on-chip memory system to solve these problems.First,we introduce victim cache to GPGPU to help us keep more data on on-chip memory.By doing so,we try to alleviate L1 D cache thrashing.A small fully associative victim cache is usually applied in CPU.However,to suit the GPGPU applications with massive concurrent threads,we design a set associative victim cache in GPGPU,which is equivalent to L1 D cache.Second,we apply a simple prediction scheme to keep the mostly used data on L1 D cache.This helps to avoid costly cache line evictions and interchanges between victim cache and L1 D cache and thus enable a better cooperation of victim cache and L1 D cache.Third,we propose to use the unoccupied registers and shared memory to hold the data of cache lines in victim cache.Which further saves the chip area and hardware cost.The experiment results show that after we introduce victim cache to GPGPU,the hit ratio of on-chip data cache can be improved 26.8% and the system performance can be improved 36.3% on average.Meanwhile,by using our prediction scheme,the cache line evictions and interchanges between L1 D cache and victim cache can be reduced 21.8% and the system performance can be further improved 4.9%.Moreover,after using the unoccupied registers and shared memory as the storage units for data blocks of victim cache,the utility of GPGPU on-chip memory can be largely improved and the hardware cost of victim cache can be significantly reduced.
Keywords/Search Tags:GPGPU, register, L1D cache, shared memory, victim cache, prediction scheme
PDF Full Text Request
Related items