Font Size: a A A

A system-level approach to fault and variation resilience in multi-core die

Posted on:2010-02-06Degree:Ph.DType:Dissertation
University:University of California, BerkeleyCandidate:Markovskiy, YuryFull Text:PDF
GTID:1448390002486099Subject:Computer Science
Abstract/Summary:
With shrinking transistors and growth in parametric variability, statically managing die yield is no longer possible. Design for Manufacturing (DFM) techniques use increasingly bigger guard-bands that waste area, power, and performance, impeding Moore's Law of semiconductor device scaling. Process Voltage Temperature (PVT) variations can turn a nominally homogeneous many-core die into a set of cores with heterogeneous performance.;Network-on-Chip provides an effective and scalable way to integrate hundreds of heterogeneous cores without forcing each to give up its own PVT-induced operating point for the chip-wide common worst case. As with asynchronous logic, a NoC of regular, redundant, many-CLK/VDD cores can deliver the average rather than the worst case system performance with greater power efficiency and fault tolerance than its globally synchronous monolithic counterparts [41, 92]. This work shows that the Voltage-Frequency Island (VFI) architectures are also the key to tolerating and compensating for PVT variations.;The VFI advantages cannot be realized without run-time task-to-core mapping and adaptive network routing that optimally match application resource requirements with heterogeneous cores and communication fabric. These systematic techniques are more effective at mitigating a variety of faults and variations than layout and circuit DFM. Most importantly, the gains from these techniques can be translated into die yield improvements and smaller DFM guard-bands.;This work investigates core sparing and network routing. The developed models demonstrate that core sparing reduces the die cost asymptotically from O(A3) to O( A1/2), and it is more cost efficient than larger design guard-bands of layout and circuit redundancy. The analysis outcome favors a greater number of smaller unreliable cores as opposed to a fewer larger reliable cores given a fixed die area. This points to the limitations and ultimately the futility of DFM techniques in the future semiconductor process generations.;Adaptive network routing enables core sparing. More critically, it simultaneously combats the two sources of network load imbalance: on-die performance heterogeneity from PVT variations and application communication topology. With stochastic PVT variations, the developed Minimal Adaptive Total Congestion (MATC) router increases the expected network saturation bandwidth by 7--23% and reduces its variance by 2--10x as compared to the Dimension Order router. With systematic PVT variations, the improvements are 5--35%. These gains of the adaptive router can compensate for degradation due to performance variations and can thus be used to reduce design guard-bands.;By treating cores as units of fault and variation tolerance, these systematic techniques provide a simple and consistent way to deal with static and dynamic performance variations and faults. These techniques are more effective than isolated DFM solutions. Rather than fighting and minimizing the on-die parametric variations, our approach takes advantage of the platform heterogeneity and manages its net system performance impact.
Keywords/Search Tags:DFM, PVT variations, Performance, Fault, Core
Related items