Font Size: a A A

An Area And Bandwidth Efcient Programmable Shader Architecture For Embedded Graphics Processing Units

Posted on:2014-04-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S ChangFull Text:PDF
GTID:1268330422968191Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the development of silicon technology and application requirement, embeddedgraphics processing units (GPU) with multiple unified shaders have been widely integrat-ed into System-on-Chip (SoCs) for high-end mobile devices. However, the number ofprogrammable shader cores in embedded GPU architecture is restricted by silicon areacost so that it is required to improve performance while maintain area efciency duringshader architecture design. Moreover, a large amount of graphics data located in externalmemory should be accessed in rendering, leading to a higher bus bandwidth and evenhuge power dissipation in embedded GPUs. Therefore, it is essential to optimize areacost and data bandwidth in programmable shader architecture. In this dissertation, someresearch works focusing on both problems are proposed, including modeling method ofmulti-core embedded GPU architecture, area efcient arithmetic datapath and processorarchitecture for shaders and bandwidth optimized vertex cache hierarchy in multi-shaderarchitecture. The main target of the proposed works is to provide fundamental theory andtechnology for future research and design of multi-core embedded GPU architecture.First, a high-level, full system simulation platform based on hybrid modeling meth-ods for embedded GPUs is proposed. To avoid slow simulation speed of complex systemsoftware, an instruction-set simulator based on QEMU is proposed. Additionally, inter-connection network and device interfaces in SoC are modeled in SystemC-TLM to im-prove simulation efciency. After that, we introduce a basic embedded GPU architecturebased on multiple unified shaders and internal data bufers. To describe its micro archi-tecture, a detailed cycle-level model is proposed and combined with the SystemC-TLMhardware model to provide a fundamental experiment platform for our research works.Second, area efcient floating-point (FP) function units in shader are proposed. Atfirst, a unified, multi-functional FP vector arithmetic unit (VAU) is implemented. To sup-port basic vector operations, the main hardware blocks in the conventional vector produc-tion unit is vectorized and multiplexed, which can efectively maintain performance andreduce huge additional area cost. Based on VAU, we introduce a method to use idle VAUsin shader architecture to calculate quadratic approximation, which can further reduce thearea cost of elementary transcendental function unit.Third, a high performance, area efcient programmable shader architecture basedon transport triggered architecture (TTA) is proposed. With the help of fine-grained datatransport and visible bypass at micro architecture level, redundant write back of instruc- tion results can be avoided, which is benefit for exploitation of instruction level parallelis-m. Then a detailed TTA-like vertex shader micro architecture is implemented. Combiningboth features of TTA and vertex processing, we define a customized shading instructionset. By configuring the number of functional units and optimizing the design of registerport and result writeback scheme, area cost of the implemented vertex shader can be fur-ther reduced. We finally implement the proposed vertex shader in both ASIC design andFPGA prototype platform to evaluate that the proposed TTA-like shader architecture canprovide high performance with reduced area cost, leading to significant area efciency forembedded platform.Finally, we introduce a primitive-oriented vertex fetch (POVF) scheme to eliminatesequential dependencies among diferent vertex batches in the multiple shader architec-ture. Based on it, we try to reduce vertex data fetching bandwidth by optimizing vertexcache hierarchy for multi-shader architecture. To reduce bus access frequency for vertexdata, a pre-TnL vertex cache combined with POVF scheme is proposed to hold recentlyfetched vertex data before shading. On the other hand, a tag-SRAM separated post-TnLvertex cache is also implemented to bufering recently shaded vertex result data at difer-ent stages of vertex processing. To guarantee valid vertex cache results, hardware logicfor in-order submission of vertex batches is also implemented in the task scheduler of themulti-shader embedded GPU architecture. Simulation results shows that the number ofredundant vertex data processing and vertex bandwidth can be reduced using the separatedpost-TnL vertex cache.Simulation and implementation results show that the area cost and vertex fetchingbandwidth can be efectively optimized using the micro architecture design methods pro-posed in this dissertation, which is a beneficial exploration for research and design ofembedded GPU architecture based on multiple unified shaders in future.
Keywords/Search Tags:Embedded Graphics Processing Unit (GPU), Programmable Shader, System-level Simulation Platform, Transport Triggered Architecture, Vertex Cache
PDF Full Text Request
Related items