Font Size: a A A

On-chip Large-scale Parallel Embedded Computing: Hierarchical Architecture Performance Model And Parallel Accelerating For H.264

Posted on:2011-02-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:S G ChenFull Text:PDF
GTID:1118330332486931Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Highperformanceembeddedcomputing(HPEC)isubiquitousinmodernsocietylife,industry and military, and it has significant impact on the development of modern society.Due to its high computation complexity and high parallelism, HPEC has been evolvingfrom traditional single thread computing paradigm to on-chip large-scale parallel embed-ded computing (OLPEC) paradigm. However, many aspects, including microprocessorarchitecture and application software, are challenging in OLPEC. H.264/AVC high defi-nition real-time video compression is popular with professional applications and customerelectronics, and it is a representative OLPEC application with high computational com-plexity and high parallelism, so that research of H.264/AVC can not only meet the chal-lenges of high definition real-time video compression with OLPEC, but will also providepracticable solutions for common OLPEC problems.Based on the detailed analysis of the data dependencies, computational complexityanddiverseparallelisminH.264/AVC,thispaperfocusesonthefollowingresearchpoints:performance model of the hierarchical architecture, serial entropy coding accelerating,parallelization of entropy encoding, hierarchical architecture platform and hierarchicalparallel H.264/AVC video encoder prototype. The key contributions are summarized asfollows.1). An extension of Amdahl's performance model for hierarchical on-chip large-scaleparallel architectures is proposed in this paper. Through the cost model of non-uniform data communication and memory access in hierarchical on-chip large-scaleparallel architectures, the paper makes an extension to Amdahl's Law to investigatetheimpactofsupernodes,eachofwhichconsistsofseveraltightly-coupledcores,onthe system performance. Simulation results reveal that, to maintain a better systemspeedup, the hierarchical architecture designers should carefully balance the sizeof the supernode and the number of supernodes in a hierarchical architecture. For agivennumberofprocessingcores, theconfigurationofthesupernodethatmakesthehierarchical architecture obtain optimal performance is of the intermediate numberofmiddle-sizedsupernodes. Theoptimalconfigurationofthesupernodevarieswiththe overall cores in the hierarchical architecture. 2). A fully pipelined CABAC hardware accelerator driven by the syntax element in-struction stream is proposed in this paper. Previous research barely considered thecooperation of host CPU with the CABAC accelerator, which is a type of importantproblem in OLPEC. The proposed CABAC architecture employs formatted syntaxelement instruction interface and allows an efficient cooperation of the acceleratorwith the H.264/AVC encoder software. Finely tuned pipeline architecture providesa fast processing speed at one symbol per cycle. Synthesis results with 0.13umstandard cell technology show that, the proposed CABAC accelerator can achievea throughput of 590Mbps with 3.21K logic gates.3). To further boost the throughput of CABAC, this paper proposes a tri-thread parallelarithmetic entropy coder P3-CABAC through partitioning of the syntax elements.Unlike the previous parallel CABAC encoders which almost focus on the fine-grained bit-level parallel algorithm and architectures, P3-CABAC statically parti-tions the syntax elements into three groups, which are processed by three threadsresources in parallel. To the best of our knowledge, it's the first time to propose athread-level parallel arithmetic coder. Because each of the three threads employsthe CABAC procedure, other fast CABAC algorithms can be directly applied toeach thread in the P3-CABAC coder to further increase the throughput. Simulationresults show that the proposed P3-CABAC coder can achieve a top speedup of 2.68at the cost of less than 3% bit-rate for high definition sequences. Compared withthe CABAC accelerator proposed in this paper before, P3-CABAC accelerator onlycost about 60% extra hardware area.4). Ahierarchical64-coremulti-DSParchitectureprototypebasedonlocally-centralized-shared-memory supernodes is proposed. Based on the research results of Amdahlperformancemodelforhierarchicalarchitectures,thispapercouplesevery4reducedDSP cores through centralized shared memory as a supernode, and connects the 16supernodes with on-chip interconnection to establish the proposed hierarchical 64-core multi-DSP architecture. Performance evaluation of the proposed prototypeis carried out through mapping of some basic algorithm on the Verilog behaviormodes. Experimental results show that the hierarchical architecture can achieve atop speedup of 1.55 compared with flat architecture, even without any special lo-cality mapping techniques used. 5). Based on the hierarchical 64-core multi-DSP architecture, a software/hardware pro-totype with P3-CABAC accelerator and hybrid macroblock level/subtask level par-allel H.264/AVC main profile encoder is proposed. One of the supernodes in thehierarchical 64-core multi-DSP architecture is replaced by a P3-CABAC accelera-tor in the prototype hardware to boost the entropy encoding. Parallel H.264/AVCalgorithm employs hybrid parallel mechanism: macroblock level parallel algorithmis carried out based on supernode, and subtasks to encode one macroblock in a su-pernode is accomplished between the 4 DSP cores in parallel. Meanwhile, a previ-ously proposed CABAC bit-rate estimation technique by our research group is im-plemented to break down the limitation of complex rate-distortion optimization tomacroblock level parallel main profile encoders. Simulation on a cycle-accurate su-pernode simulator is carried out, and analysis results show that, with the help of therate estimation technique and the P3-CABAC accelerator, the proposed prototypeencoder can achieve an average speedup of about 50 for high definition sequences.The above contributions provide efficient examples for meeting the challenges ofhigh definition real-time H.264/AVC parallel encoder and other HPEC problem underOLPEC environment.
Keywords/Search Tags:on-chip large-scale parallel, high performance embedded com-puting, hierarchicalarchitecture, performancemodel, H.264/AVC parallelencoder, accelerator
PDF Full Text Request
Related items