Font Size: a A A

Highly efficient multithreaded architecture

Posted on:2006-11-20Degree:Ph.DType:Thesis
University:University of RochesterCandidate:El-Moursy, Ali AFull Text:PDF
GTID:2458390008953426Subject:Engineering
Abstract/Summary:
The performance and power optimization of dynamic superscalar micro-processors requires striking a careful balance between exploiting parallelism and hardware simplification. Hardware structures which are needlessly complex may exacerbate critical timing paths and dissipate extra power. In a Simultaneous Multi-Threaded (SMT) processor, it is particularly challenging to achieve hardware structure simplification due to the increased utilization of the structures afforded by multi-threading.; In this thesis, we take three approaches to increase the efficiency of SMT processors. First, we propose new front-end policies that reduce the required integer and floating point issue queue sizes in a monolithic SMT processor. We explore both general policies as well as those directed towards alleviating a particular cause of issue queue inefficiency. For the same level of performance, the most effective policies reduce the issue queue occupancy by 33% for an SMT processor with appropriately-sized issue queue resources. Second, we examine processor partitioning options for a large numbers of on-chip threads. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a Clustered Multi-Threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. Finally, we devise thread co-scheduling policies for a Chip Multi-Processor (CMP) of dual-threaded processors. Our techniques, which operate at time-slice granularity, use ready and in-flight instructions metrics to co-schedule compatible threads. We achieve a 20% performance improvement over the average of the different possible arbitrary thread groupings, and about 30% better performance than the worst arbitrary grouping of threads.
Keywords/Search Tags:Performance, Processor, Issue queue, SMT, Threads
Related items