Font Size: a A A

Research On Key Technologies Of Scalable 64-Core Processor-Single Core, Accelerator-Rich Architecture And Realization Of H.264 Decoder

Posted on:2015-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:Z YuFull Text:PDF
GTID:2308330464960968Subject:Microelectronics and Solid State Electronics
Abstract/Summary:PDF Full Text Request
Rapidly emerging applications like electronical communications, multimedia, information security and cloud computing bring great convenience and joy to people’s life. However, they also bring heavy and large amount of computation, which are heavy burden for the terminals, especially the mobile and embedded hardware. Unfortunately, people encounter the "power wall" in pursuit for high performance. In recent years, in order to maintain Moore’s law, multi-core processor emerges as a promising solution. However, the traditional multicore processor still shows poor performance and low energy efficiency solving complex specific applications.Targeting the applications mentioned before, we carefully analyzed the characteristics of each application and made a compromise in energy efficiency and programmability. Based on the specific application, we designed a 64 core processor with rich heterogeneous accelerators, efficient single core and low power register file and instruction memory, in order to achieve high energy efficiency.Contributions of this dissertation are summarized as follows:(1) Combined LAN (local bidirectional token ring) and WAN (global packet switched interconnect) on-chip interconnectionInspired by the LAN and WAN concept from computer network area, we designed a hybrid global 2D-mesh packet-switched network and local bidirectional multi-token ring interconnect network. Since applications generally show weak global communications and strong local communications, we utilized the packet-switching to further improve bandwidth utilization and whole-chip resource sharing while low-cost local ring interconnect to reduce path setup/release overhead. The ring interconnects also support one-cycle communication from end to end thus improve efficiency.(2) Heterogeneous accelerator-rich architecture designFor different applications, specific accelerators are extracted to improve the performance of the application. Experimental results show that the performance can be improved as high as 10X. This paper also adopts the ring mentioned before to connect the accelerator and the processor, showing high efficiency with FIFO communication between processors and accelerators.(3) Low power register file designIn the power breakdown of an embedded processor, the register file consumes around 16% of the entire processor’s power. Observing that some read and write operation to the register file is useless, we designed the asynchronous-clock controlled read-isolation and software directed write discarding mechanism to reduce the register file power consumption by 37%.(4) Single instruction multiple process architecture designIn the embedded application field, neighboring modules in parallel have highly identical instruction code. Thus same instructions will be fetched if these modules are mapped on a conventional multicore platform, which indicates dramatic redundancy in instruction supply. We proposed SIMP to reconfigure them to master-slave mode that only the master core fetches instructions, and distributes the instructions to the slave core(s), while the instruction memory of slave cores are shut down to save power. Experimental results show as high as 21.9% of the system power reduction.(5) Design of H.264 baseline decoderAccording to the characteristics of the H.264 decoder, we extracted the computing kernel, designed hardware accelerators along with software. We used four single cores and four accelerators to achieve 1080p@20fps throughput of the intra baseline H.264 decoder. With 16 cores and 16 accelerators, the figure is expected to be 1080p@80fps.(6) Physical designWe used the TSMC 65nm GP process to realize the physical design of the chip. We adopted hierarchical flow, and used the DC-Topographical+ICC method. Large quantity of useful clock skew is utilized to improve timing. Signoff timing analysis shows that the critical path is 0.99ns (including 0.1ns uncertainty), achieving the design requirements of max frequency of lGHz. When executing DES program, a single node including a processor and an accelerator consumes 21.4mW.
Keywords/Search Tags:Multicore Processor, Heterogeneous Accelerators-Rich, Low-power Design, Single Instruction Multiple Process, H.264 decoder
PDF Full Text Request
Related items