Font Size: a A A

Key Techniques Research On Terascale Embedded Computing

Posted on:2013-05-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q M YangFull Text:PDF
GTID:1268330422974289Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the evolution of more sophisticated communication standards and algorithms,embedded applications exhibit higher performance and efficiency requirements. Someemerging applications demand terascale operations per second. Although the rapiddevelopment of VLSI technology enables building processor with the tera order ofcomputing capacity, how to transfer the billions of transistors to the actual computingpower is still a challenging task. Using the simple control structure, traditionalembedded processor can get very low power consumption, but not provide enoughperformance. High performance microprocessors such as GPU and MIC High integratebillions of transistors by the many core technology, and can provide the performanceexceeding one Tops, but they are far from meeting the need of the future embeddedapplication in power and energy efficiency because they used the technologies ofmultithread and shared coherent cache, which consume much energy. To solve theabove problems, the subject of “Key techniques Research on terascale embeddedcomputing” is selected by this article.This article focuses on various energy-efficient architecture technologies, includingnew data memory hierarchy design, interconnection of functional units in fullydistributed VLIW, ultra low power processor core design, the organization ofcomputing resources. This thesis has completed the following main contributions andinnovations:1. We propose a multi-level granularity-matched register hierarchy named MGR.MGR divides the data access of embedded applications into three layers. The outermostlayer deals with the sequential and predictable streaming data; the middle layer dealswith block data and the dependencies between blocks are weak; the innermost layerdeals with the data within the same block and the access pattern is flexible and random.Corresponding to the three layers, MGR use frame buffer register file, the enhancedregister file and tiny-sized pixel register file to capture their respective data localities.So each memory layer is concerned only with its own function and its hardwareimplementation becomes simple and efficient. Compared to other typical memoryhierarchy, the results show that MGR can get53%-62%of reduction in energyconsumption, while achieving almost the same performance.2. We study the partial-connected crossbar for fully distributed VLIW. Thecrossbar with full connectivity is high delay, high power consumption and weak scaling.We first analyze the usage of full crossbar in embedded applications and summarizeseveral typical communication patterns. Corresponding to them, kinds of crossbars withsparse connectivity are proposed. We model the delay, area, power of the partialconnected crossbar. The experimental results show that, compare to the full crossbar, partial connected crossbar can greatly reduce the hardware cost while decreasingperformance slightly. Moreover, when scaling the number of function units in VLIW,the partial connected crossbar will exhibit more efficiency.3. We design an ultra-low-power embedded processor core. The future many coreprocessors may consist of a large number of small processor cores and some bigprocessor cores may construct. As the role of small core, an ultra-low-power embeddedprocessor core named Smart Core is proposed. On the methodologies of explicit paralleland accurate computing, Smart Core use the VLIW execution mode, multi-level datamemory hierarchy (streaming memory+hierarchical register file+tiny-sized registerfile), and asymmetrical fully distributed instruction register to reduce the energy ofinstruction pipeline, data supply and instruction supply correspondingly. Preliminaryresults show that Smart Core achieves an energy efficiency that is25x greater than thetraditional embedded RISC processor. When scaled to a40nm CMOS technology,single chip multi-processor, consisted of many cores like Smart Core, is capable ofproviding more than1Tops performance while achieving efficiency of100Gops/W ormore.4. We present a multi-granularity reconfigurable DSP based on stream Architecturetemplate named MGR-SAT. MGR-SAT merges stream processing technology, dynamicreconfigurable technology and platform-based technology, consisting of scalar core,stream processing core and the external interfaces. The stream processing consists of acoarse-grained reconfigurable unit and a fine-grained reconfigurable unit and can bereconfigurable dynamically when running. Scale core is responsible for configuring thestream processing core, initiating it and enabling the transfers of block data. Theexperimental results show that, compared to other typical processing platform,MGR-SAT delivers higher performance and power efficiency significantly.
Keywords/Search Tags:Terascale, Data Memory Hierarchy, Partial-connected CrossbarStream Architecture Template, VLIW, Dynamic Partial Reconfiguration
PDF Full Text Request
Related items