Research On Clustered Superscalar Processor

Posted on:2010-09-09

Degree:Doctor

Type:Dissertation

Country:China

Candidate:B Yang

Full Text:PDF

GTID:1268330392472527

Subject:Microelectronics and Solid State Electronics

Abstract/Summary:

PDF Full Text Request

VLSI design is now facing new problems as digital technology and semiconductorfabrication develop. When it comes to processor design, the problems are latency of wireand memory access, power efficiency, complexity and cost of design as well as the80-percent average increase of computation demand. While traditional microarchitectureswith more monolithic units and thus less scalability can barely satisfy the requirements ofcurrent technical and economic development, a series of new architectures are proposed,among which is clustered superscalar processor. Clustered superscalar processor solvesthe problems mentioned above by replacing monolithic units with small cooperatingunits (clusters) and paying more attention to their cooperation. As a result, it is morescalable and also a comparable or even higher performance could be achieved.This dissertation first dig deep into steering policy as it plays a critical role in theperformance of clustered superscalar processor. We describe and analyze several classicalsteering policies, including the effect of clustering granularity, size of instructionqueue(IQ), inter-cluster communication latency, add ports of IQ and fault rules. Ourresults show that the performance of clustered superscalar processor with differentsteering policies varies as the simulation condition changes. We notice that classicaldependent-based steering policy can still be improved in the aspect of load balancing.And it is also unnecessary to assign the same number of add ports as fetching width fordependent policy. Based on these facts, we propose a new steering policy-LA policy,which balances the workload by constraining the number of add ports for particular IQsize. The LA policy takes load balancing several times to achieve the same effect ofaccumulative one-time load balancing in DCOUNT policy. Compared with DCOUNT,LA policy reduces the hardware complexity and area cost while still maintains acomparable performance.In order to identify the performance bottlenecks of clustered superscalar processor,we then propose an online critical-path analytical framework which is more accurate, fastand efficient based upon an improved dependence-graph model of which origin versionis proposed by Fields. We integrate the framework with the simulator; utilize it in ourresearch work, including CPI breakdowns and critical-path analysis.Afterwards, our research focuses on microarchitecture. We propose a clusteredsuperscalar microarchitecture based on fully understanding the pros and cons ofPartitioned Register File (PRF) and copy instruction mechanism. The partitioned renaming mechanism with clustered renaming stages reduces hardware cost andimproves scalability. And copy instruction mechanism which is critical to instructionscheduling and execution as our result shows is completed by adding extra bits to IQ.Performance of1x8clustered processor with such IQ structure increases by2.5%ininteger programs and does not change much in float point programs comparing withprocessor with separated copy instruction queue. As for resource contention betweencopy instructions and other instructions, performance increases by26.2%in integerprograms and59.7%in float point programs.Clustered superscalar processor designers begin to use point-to-point networks (P2Pnetworks) for inter-cluster connection since wire latency contributes more to on chiplatency under deep sub-micro technics. In fact, performance of Clustered superscalarprocessor is very sensitive to inter-cluster communication latency and thus an optimizedP2P network is in great demand. As a result, we first extract and analyze thecommunication characteristics of P2P networks in clustered superscalar processor,pointing out its property of low workload and relatively balancing traffic distribution.Based on these characteristics we simulate and evaluate the performance of P2P networksvia a configurable network platform, and finally decide an optimization plan for networkand router design. With our derived results, we propose a P2P network with special routerand network structure which can be applied to clustered superscalar processors. Theproposed network is made up of a pair of networks sharing one specific control logicblock and can thus transmit tags and data separately. Flow-control logics are removedfrom routers while bypass mechanics are implemented. Using our P2P networks,processors have a5.8%performance increase comparing with those using SynNet withideal bus.Finally, for the inborn but important problem--latency of Cache access, we reducethe latency of subordinate Cache access by giving each cluster a private speculative L0Cache. We add access updating to L0Cache and fix the bandwidth for access. Comparedwith other research works, our L0Cache has a44.8%increase in access hit rate whilecorrectness can still be guaranteed. Simulation results show that8-cluster processor with4KB2-way set associative clustered L0Cache increases its performance by5.6%onaverage. Especially, for some specific programs, the increase reaches20%, which is3.1%higher than in other research works.

Keywords/Search Tags:

Clustered Superscalar Processor, Critical Path Breakdown, Microarchitecture, Instruction Steering Policy, Scalar Network, L0Cache

PDF Full Text Request

Related items

1	Microarchitecture for billion-transistor VLSI superscalar processors
2	Design Of Alpha-Based Clustered Superscalar Processor IU
3	Research For Micro-Architecture Optimization On Media DSP IP Core
4	Research On The Heterogeneous Media Dual-Issue Processor Design
5	Microarchitecture and compilation support for clustered instruction-level parallel processors
6	Research On Microarchitecture Of Media Digital Signal Processor MediaDSP6410
7	Research And Design Of High Performance Digital Signal Processor
8	Quantitative Analysis On The Impact Of Memory Access Behavior By Instruction Dynamic Scheduling
9	Research And Design Of Superscalar Microprocessor Based On RISC-V Instruction Architecture
10	Research On Energy Efficiency Optimization Of Superscalar Processor