A system-level approach to fault and variation resilience in multi-core die

Posted on:2010-02-06

Degree:Ph.D

Type:Dissertation

University:University of California, Berkeley

Candidate:Markovskiy, Yury

Full Text:PDF

GTID:1448390002486099

Subject:Computer Science

Abstract/Summary:

With shrinking transistors and growth in parametric variability, statically managing die yield is no longer possible. Design for Manufacturing (DFM) techniques use increasingly bigger guard-bands that waste area, power, and performance, impeding Moore's Law of semiconductor device scaling. Process Voltage Temperature (PVT) variations can turn a nominally homogeneous many-core die into a set of cores with heterogeneous performance.;Network-on-Chip provides an effective and scalable way to integrate hundreds of heterogeneous cores without forcing each to give up its own PVT-induced operating point for the chip-wide common worst case. As with asynchronous logic, a NoC of regular, redundant, many-CLK/VDD cores can deliver the average rather than the worst case system performance with greater power efficiency and fault tolerance than its globally synchronous monolithic counterparts [41, 92]. This work shows that the Voltage-Frequency Island (VFI) architectures are also the key to tolerating and compensating for PVT variations.;The VFI advantages cannot be realized without run-time task-to-core mapping and adaptive network routing that optimally match application resource requirements with heterogeneous cores and communication fabric. These systematic techniques are more effective at mitigating a variety of faults and variations than layout and circuit DFM. Most importantly, the gains from these techniques can be translated into die yield improvements and smaller DFM guard-bands.;This work investigates core sparing and network routing. The developed models demonstrate that core sparing reduces the die cost asymptotically from O(A3) to O( A1/2), and it is more cost efficient than larger design guard-bands of layout and circuit redundancy. The analysis outcome favors a greater number of smaller unreliable cores as opposed to a fewer larger reliable cores given a fixed die area. This points to the limitations and ultimately the futility of DFM techniques in the future semiconductor process generations.;Adaptive network routing enables core sparing. More critically, it simultaneously combats the two sources of network load imbalance: on-die performance heterogeneity from PVT variations and application communication topology. With stochastic PVT variations, the developed Minimal Adaptive Total Congestion (MATC) router increases the expected network saturation bandwidth by 7--23% and reduces its variance by 2--10x as compared to the Dimension Order router. With systematic PVT variations, the improvements are 5--35%. These gains of the adaptive router can compensate for degradation due to performance variations and can thus be used to reduce design guard-bands.;By treating cores as units of fault and variation tolerance, these systematic techniques provide a simple and consistent way to deal with static and dynamic performance variations and faults. These techniques are more effective than isolated DFM solutions. Rather than fighting and minimizing the on-die parametric variations, our approach takes advantage of the platform heterogeneity and manages its net system performance impact.

Keywords/Search Tags:

DFM, PVT variations, Performance, Fault, Core

Related items

1	Accurate and efficient assessment of the impact of interconnect variations on CMOS IC timing performance
2	Performance Analysis And Enhancement Of Virtualization On Multi-core Processors
3	Technology impacts of CMOS scaling on microprocessor core design for hard-fault tolerance in single-core applications and optimized throughput in throughput-oriented chip multiprocessors
4	Mitigating the cost, performance, and power overheads induced by load variations in multicore cloud servers
5	Research On Self-adaptive Fault-tolerent Techniques For Many Core Processors
6	Research Of Physical Unclonable Function Based On Measuring Power Distribution System Resistance Variations
7	Parallel Fault Simulation On Multi-core CPU
8	Performance Assessment And Fault Detection Of Control System
9	Performance Fault Detection And Avoidance For Parallel Programs
10	Algorithme d'adaptation du filtre de Kalman aux variations soudaines de bruit