Font Size: a A A

Research On Clustered Superscalar Processor

Posted on:2010-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:B YangFull Text:PDF
GTID:1268330392472527Subject:Microelectronics and Solid State Electronics
Abstract/Summary:PDF Full Text Request
VLSI design is now facing new problems as digital technology and semiconductorfabrication develop. When it comes to processor design, the problems are latency of wireand memory access, power efficiency, complexity and cost of design as well as the80-percent average increase of computation demand. While traditional microarchitectureswith more monolithic units and thus less scalability can barely satisfy the requirements ofcurrent technical and economic development, a series of new architectures are proposed,among which is clustered superscalar processor. Clustered superscalar processor solvesthe problems mentioned above by replacing monolithic units with small cooperatingunits (clusters) and paying more attention to their cooperation. As a result, it is morescalable and also a comparable or even higher performance could be achieved.This dissertation first dig deep into steering policy as it plays a critical role in theperformance of clustered superscalar processor. We describe and analyze several classicalsteering policies, including the effect of clustering granularity, size of instructionqueue(IQ), inter-cluster communication latency, add ports of IQ and fault rules. Ourresults show that the performance of clustered superscalar processor with differentsteering policies varies as the simulation condition changes. We notice that classicaldependent-based steering policy can still be improved in the aspect of load balancing.And it is also unnecessary to assign the same number of add ports as fetching width fordependent policy. Based on these facts, we propose a new steering policy-LA policy,which balances the workload by constraining the number of add ports for particular IQsize. The LA policy takes load balancing several times to achieve the same effect ofaccumulative one-time load balancing in DCOUNT policy. Compared with DCOUNT,LA policy reduces the hardware complexity and area cost while still maintains acomparable performance.In order to identify the performance bottlenecks of clustered superscalar processor,we then propose an online critical-path analytical framework which is more accurate, fastand efficient based upon an improved dependence-graph model of which origin versionis proposed by Fields. We integrate the framework with the simulator; utilize it in ourresearch work, including CPI breakdowns and critical-path analysis.Afterwards, our research focuses on microarchitecture. We propose a clusteredsuperscalar microarchitecture based on fully understanding the pros and cons ofPartitioned Register File (PRF) and copy instruction mechanism. The partitioned renaming mechanism with clustered renaming stages reduces hardware cost andimproves scalability. And copy instruction mechanism which is critical to instructionscheduling and execution as our result shows is completed by adding extra bits to IQ.Performance of1x8clustered processor with such IQ structure increases by2.5%ininteger programs and does not change much in float point programs comparing withprocessor with separated copy instruction queue. As for resource contention betweencopy instructions and other instructions, performance increases by26.2%in integerprograms and59.7%in float point programs.Clustered superscalar processor designers begin to use point-to-point networks (P2Pnetworks) for inter-cluster connection since wire latency contributes more to on chiplatency under deep sub-micro technics. In fact, performance of Clustered superscalarprocessor is very sensitive to inter-cluster communication latency and thus an optimizedP2P network is in great demand. As a result, we first extract and analyze thecommunication characteristics of P2P networks in clustered superscalar processor,pointing out its property of low workload and relatively balancing traffic distribution.Based on these characteristics we simulate and evaluate the performance of P2P networksvia a configurable network platform, and finally decide an optimization plan for networkand router design. With our derived results, we propose a P2P network with special routerand network structure which can be applied to clustered superscalar processors. Theproposed network is made up of a pair of networks sharing one specific control logicblock and can thus transmit tags and data separately. Flow-control logics are removedfrom routers while bypass mechanics are implemented. Using our P2P networks,processors have a5.8%performance increase comparing with those using SynNet withideal bus.Finally, for the inborn but important problem--latency of Cache access, we reducethe latency of subordinate Cache access by giving each cluster a private speculative L0Cache. We add access updating to L0Cache and fix the bandwidth for access. Comparedwith other research works, our L0Cache has a44.8%increase in access hit rate whilecorrectness can still be guaranteed. Simulation results show that8-cluster processor with4KB2-way set associative clustered L0Cache increases its performance by5.6%onaverage. Especially, for some specific programs, the increase reaches20%, which is3.1%higher than in other research works.
Keywords/Search Tags:Clustered Superscalar Processor, Critical Path Breakdown, Microarchitecture, Instruction Steering Policy, Scalar Network, L0Cache
PDF Full Text Request
Related items