Font Size: a A A

Research On Key Techniques Of Superscalar Embedded Processor Design

Posted on:2010-06-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y MengFull Text:PDF
GTID:1118360302983169Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the development of embedded applications, high performance and low power embedded processor will become essential in the future. Superscalar is an advanced computer architecture designed to exploit more instruction level parallelism. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. It has been a main technique trend in high end embedded processor design. In this thesis we analyzed some high performance and low power techniques for implementation of superscalar embedded processor. The original contributions of this thesis are as follows:1. Zero-delay branch prediction and low power branch folding techniques. A branch prediction mechanism without global branch history alias and pipeline stall was proposed in this thesis. It accessed the branch history table (BHT) with the prediction history register (PHR) and improved the accuracy of branch prediction. According to the temporal and spatial characteristics of loop instruction fetching, an instruction recycling based low power branch folding mechanism was introduced in this chapter. The instruction recycling buffer reused the same hardware resource of instruction buffer. Elapsed instructions in loops were frozen in instruction buffer and pushed into pipeline directly when branch folding occurs. As a result, it could reduce the branch penalty and eliminate the instruction cache accessing during branch folding.2. Hardware based speculation execution with non-blocking issue and fast retirement. A non-blocking instruction issue mechanism with dynamic reservation distribution was proposed to reduce the impact of data dependency. Out-of-order execution with speculation tagging provided a way to resolve the stall of control dependency and improved the instruction level parallelism during execution. It could also correct the branch mis-prediction ahead of branch instruction retirement and minimize the penalty of mis-prediction in superscalar pipeline. Fast retirement made the sub-pipeline of execution units write back directly and removed communication delay between long latency execution units and pipeline retiring logic.3. High performance and low power L1 on-chip memory. Flexible memory hierarchy and low power accessing mechanism are critical for superscalar embedded processor. A low power instruction cache accessing technique was proposed according to the sequential behaviors of instruction fetching. It stopped accessing Tag array of current way and all memory banks of other ways during sequential instruction fetching and backward branches. Non-blocking data cache was proposed to remove the interlock between data cache accessing requests. Scratchpad memory was also analyzed in this thesis. It supported "local memory" and "cache" modes. A task level parallelism between processor and DMA was introduced in SPM architecture. Hardware stack technique was proposed based on the extension interface of SPM to provide seamless context switching of programs.4. General coprocessor extension techniques. Coprocessor is a primary method for application extension in embedded processor. In this thesis we analyzed a general coprocessor (GCP) extension mechanism. A GCP instruction set was proposed firstly. It bridged the base instruction set to a dedicated coprocessor instruction set. GCP instruction subset made it possible for coprocessor extension in 16-bit instruction set architecture. Coprocessor interface supported both synchronous and asynchronous communication. An inexact exception execution mode improved instruction level parallelism between basic instructions and coprocessor instructions. A configurable interrupt mechanism was also added into GCP to make the communication efficient.5. RTL level Observability Don't Care (ODC) algorithm for data path low power optimization. Dynamic power of data path is the principal power dissipation of embedded processor. This thesis analyzed the clock gating technique based on ODC algorithm. It extracted the ODC conditions of RTL logic level signals in Bus-ODC model. Data path was cut down into several short paths and computed ODC condition separately in Path-ODC model. These two models could reduce the ODC computation load and improve the computation efficiency. Probability of ODC conditions was also proposed and used as an important basis in clock gating logic synthesis. It was preferred to insert clock gating logic into the data path with high probability of ODC condition. Probability driven clock gating logic synthesis improved the efficiency of clock gating network with tiny hardware overhead.6. Object-oriented cycle accurate processor model and fast processor simulation model for SoC verification. The pipeline functions were classified into two categories: architecture model and behavior model. Architcture models simulated the pipeline scheduling, and behavior models simulated architecture independent functions. Cycle accurate processor model could be quickly reconfigured by scheduling behavior models in architecture models. And it was helpful in the design space exploration of superscalar pipeline implementation. This thesis also provided a means of processor modeling for accelerating the SoC verification. In the mechanism of Temporal Redundant Compression, idle states of system bus were detected and all the redundant simulations in these periods were skipped. Spatial Redundant Compression monitored the address of data operation dynamically, and made processor model access internal memory when the target address fell into region of local memory.Techniques proposed in this thesis facilitate the implementation of superscalar embedded processor, and have positive effects on performance, power and extendibility.
Keywords/Search Tags:Superscalar Embedded Processor, Branch Prediction, Branch Folding, Speculation Execution, On-chip Memory, Coprocessor Extension, Clock Gating, Cycle Accurate Model
PDF Full Text Request
Related items