Instruction-flow Scheduling Mechanism For High-performance SIMD DSP

Posted on:2015-06-25

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Yang

Full Text:PDF

GTID:1108330479479661

Subject:Microelectronics and Solid State Electronics

Abstract/Summary:

PDF Full Text Request

With the development of the embedded applications and the progress of the chip design technology, processor architecture focuses on using more parallel computing resource to explore parallelism, rather than relying the more complex serial hardware design and higher clock frequencies. Architecture, based on Very Long Instruction Word(VLIW) integrated with variable-length instruction set, Single Instruction stream Multiple Data streams(SIMD) and multicore technologies, has become the mainstream of Digital Signal Processor(DSP) architecture design. Although those architecture technologies can fully exploit the parallelism inside applications and thus greatly improve processor performance with low hardware overhead, they are gradually limited by date path utilization rate and scalable problems while the increase of instruction issue and SIMD width.This thesis focuses on performance-oriented instruction scheduling technology, and mainly contains three parts, including instruction flow distribution, instruction fetch and dispatch, instruction flow execution. Firstly, this thesis analyzes the relationship of the architecture parameters, including SIMD width, VLIW length and multi-core number, and investigates the influence of resource allocation on system efficiency while the changing of workloadâ€™s characteristic values including Thread-Level Parallelism(TLP), Instruction-Level Parallelism(ILP), Data-Level Parallelism(DLP). This investigation can be used to explain the performance bottleneck of the architecture and realize balances between the utilization of data path and system scalability. Secondly, As the instruction fetch and dispatch efficiency of the variable-length VLIW processor can significant affect the utilization of the data path, this thesis studies the key problems of instruction fetch and issue to reduce the pipeline stalls caused by them, greatly improving the DSP performance. Thirdly, for the widely used SIMD technology in high-performance DSP, increasing SIMD width does not necessarily improve the execution performance. The demands for SIMD width and program flow control of different algorithms are quite different. Therefore, increasing the SIMD utilization rate can dramatically improve system performance. This thesis studies the key techniques of instruction flow scheduling for super wide SIMD DSP and the contributions are as following:(1) Based on the recent yearsâ€™ analysis and research on the performance and power consumption of general multicore processors, this work builds a new analysis model with parameterized performance and power consumption to evaluate the performance and power of hierarchical on-chip large-scale parallel architectures. This model abstracts parameters, such as multicore number, super node dimensions, processing unit number and function unit number, to examine the effect of resource allocation on system efficiency while the workload feature valuesâ€”TLP, ILP and DLPâ€”change under certain limitations of performance and power consumption. The analysis results provide the reasonable choice of super high performance DSP structure design and the theoretical foundation for improving the system scalability, further revealing the performance bottleneck of the structures.(2) In order to improve the fetch and issue efficiency of the variable-length VLIW processor, reduce and eliminate the drawbacks of the current methods for improving the efficiency of single-thread instruction fetch and issue pipeline, this thesis proposes a highly efficient instruction fetch, issue framework based on variable length VLIW. This framework introduces a mechanism to detect and obsolete invalid instructions to eliminate the overhead of invalid instruction fetch; introduces a mechanism to bypass the missing instructions to reduce the pipeline stalls caused by those instructions; and introduces a mechanism of variable length instruction issue window to solve the problems of instruction separation, providing high efficient continuous instruction flow for the structure. This research further exposes the acceleration mechanism for VLIW-based instruction fetch, issue pipeline, and clarifies the acceleration technique for single-thread pipeline controlling, which is very important for guiding the highly efficient pipeline controlling design. This kind of instruction fetch and issue pipeline can be well applied to any VLIW-based processors.(3) This thesis proposes Divergent Branch Threads Compaction(DBTC) mechanism to solve the SIMD resources idling problem caused by insufficient DLP in applications, such as a small number of loop iterations, complex control flow, and branch thread divergence execution behaviors on SIMD architectures. To convert the parallel resources of SIMD hardware to actual application performance is the key to exploit the performance of SIMD structure processors. Experimental results show that compared to the reference SIMD structure, DBTC can get good performance improvement. This mechanism can be well expanded and applied to SIMD architecture.(4) This thesis proposes Decoupled Iteration Mapping(DIM), a technique that dynamically maps a cross-iteration dependency loop onto the improved SIMD architecture which explores the medium-granularity thread-level parallelism lurking in the algorithms and achieves multicore-like thread-parallel performance. The proposed scheme is to support a hybrid form of execution where the loops including cross-iteration dependencies are statically identified and dynamically mapped on the improved SIMD. Each section of loop is mapped to an individual PE when the SIMD lane is available. The inter-PE communications can occur only at the section boundaries across dedicated data buffer chain(DBC). DIM keeps dependences thread-local, thus avoiding communication latency on the critical path. Experimental results show that, the proposed DIM scheme maintains the key advantage of SIMD architecture, and achieves performance speedup for applications with cross-iteration dependency loops.(5) This work proposes Hardware Support Software Pipeline(HSSP) mechanism to accelerate the execution of data-level parallelism loops with regular control flows, and thus proposes the multi-mode instruction-issuing scheme. This idea combines the DBTC technique for solving the control flows of irregular conditional branches, the DIM technique for improving loops with loop-carried dependence and the HSSP technique for optimizing regular control flow. It can systematically solve the above problems including irregular and regular applications. It improve the integral capabilities of SIMD, truly breaking through the key bottlenecks of SIMD structures.

Keywords/Search Tags:

Single Instruction stream Multiple Data stream(SIMD), very long instruction word(VLIW), Multi-mode Instruction Issuing, Variable-length Instruction, Loop-carried Dependency, Conditional Vector Branch, Software Pipeline

PDF Full Text Request

Related items

1	The Orchestration Of Instruction Issuing In Data Parallel Processors
2	Design And Implementation Of GCC Instruction Scheduling Algorithm Based On TMS320C6000
3	Research And Implementation Of FT64-2 Kernel Assembler
4	Exploiting multi-grained parallelism for multiple-instruction-stream architectures
5	The Design And Implement Of Instruction Decode&Control Unit In FT-C55LP
6	Design And Implementation Of The Instruction Fetch Unit And Multiple Instruction Flows Extension In The YHFT-Matrix DSP
7	Optimization And Design Of Instruction Pipeline Of YHFT-DX High Performance DSP
8	Studies On CRS Crossbar Based Single-Instruction Multiple-Data Stream Computing Architectures
9	Research And Design Of Unified Shader With Automatic Scheduling Of Threads And VLIW
10	Research On The Key Techniques Of Application-Specific Instruction-Set Processors