Font Size: a A A

Thermal Reliability Optimization For Optical Network-on-Chip Based Many-Core Systems

Posted on:2021-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Q LiFull Text:PDF
GTID:1488306464457974Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Moore's Law,coupled with Dennard scaling,has brought exponential performance improvement over the past five decades.With the development of nanotechnology and the rapid progress in IC manufacturing techniques,the size of the CMOS transistors has been continuously shrinking.The number of transistors integrated on a single chip has increased exponentially.Many-core systems have become the leading design framework for VLSI and embedded systems.As an emerging communication architecture for new-generation many-core systems,optical network-on-chip(ONoC)provides unique advantages of ultra-high bandwidth,low latency,and low power dissipation for inter-processor communication.Therefore,ONoC-based many-core systems achieve powerful parallel processing capability,efficient computing and communication performance,excellent on-chip resource utilization,good scalability,etc.These favorable features make them widely used in high-performance computing and supercomputing systems.However,due to the rapid growth of power density and the limited advancement of heat dissipation techniques,many-core systems suffer severe overheating problems.To stay within the allowable power budget and safe temperature limits,silicon chips cannot be fully utilized.A large portion of processor cores will be 'dark' or 'dim,' i.e.,either be powered-off or under-clocked,which raises the so-called 'Dark Silicon' phenomenon.In the dark silicon era,thermal reliability becomes a critical challenge for ONoC-based many-core systems.On the one hand,adhere to the safe chip temperature constraint,a utilization wall significantly limits the computing performance of many-core processors.On the other hand,due to the intrinsic thermal susceptibility of ONoCs,under on-chip temperature variations,core optical devices used for inter-processor communication suffer from significant thermal-induced optical power loss,which threatens ONoCs' communication reliability.Therefore,the research on thermal reliability optimization is urgent for ONoC-based many-core systems.In this thesis,we proposed novel hardware/software co-design methodologies and techniques to co-optimize the thermal reliability,system performance,and energy efficiency of ONoC-based many-core systems.These research outputs will provide strong technical support for the design of next-generation high-performance high-reliability many-core systems.The main contributions of this thesis are listed as follows:(1)Temperature prediction and optimization for dark silicon manycores.An analytical thermal model for accurate yet quick processor temperature prediction is proposed.By modeling the thermal transmission in many-core processors as an equivalent first-order RC circuit according to the duality between heat transfer and electrical phenomena,added with two empirical scaling factors,this thermal model achieves high prediction accuracy and running efficiency.On this basis,a system-level task mapping technique for chip temperature optimization is then put forward,including a mixed-integer linear programming(MILP)model and a greedy heuristic algorithm(called Temperature-Constrained Task Selection,TCTS).MILP model can obtain the optimal task-to-core assignment with the minimum chip peak temperature.TCTS algorithm is further proposed to maximize the system performance within the safe chip temperature range.(2)Based on hardware/software co-design,two brand-new thermal monitoring schemes are developed for ONoCs,which lay a solid foundation for the thermal reliability optimization of ONoCs.The centralized thermal monitoring scheme: By utilizing the intrinsic thermal sensitivity of existing micro-ring resonators(MRs)and the inter-processor communications in ONoCs,this scheme implements accurate and low-cost centralized temperature monitoring for ONoCs while hardly requiring additional hardware support.First,the thermal sensitivity of MRs is quantitatively analyzed and systematically modeled,which functions as the theoretical foundation of the proposed thermal monitoring schemes.A basic thermal sensing(BTS)module is then developed by leveraging single optical routers' idle injection and ejection ports.As inter-processor communications often occupy the injection or ejection ports of routers at runtime,a collaborative thermal sensing(CTS)approach is further proposed for ONoC temperature estimation by combining the BTS module with a lightweight software solution.The centralized scheme enables global synchronous thermal monitoring for ONoCs.However,its scalability is limited.As the scale of the network increases,the complexity of centralized management grows exponentially.In contrast,distributed thermal monitoring schemes have better flexibility and scalability.The distributed thermal monitoring scheme: First,a novel process variation(PV)-tolerant optical temperature sensor,PV-OTS,is designed based on cascaded MRs.By exploiting the intrinsic thermal sensitivity of MRs and the hidden 'redundancy' in WDM technology,this design achieves accurate and efficient temperature measurement,with strong robustness to PVs.On this basis,a lightweight implementation scheme,Arb Link,is put forward for ONoCs,in which the sensor design is tailored to fit perfectly into ONoC architecture,enabling optical routers to be reused for on-chip thermal monitoring with trivial hardware overhead.The proposed sensor design and implementation scheme are applicable to different topologies and router structures while also scalable to large-scale networks.(3)Efficient routing techniques for ONoCs to collaboratively optimize communication performance and energy efficiency with guaranteed thermal reliability.By analyzing the thermal effect in ONoCs,a network-level routing criterion is first presented.Combined with device-level wavelength tuning,it can implement thermal-reliable ONoCs.Two routing approaches,including a MILP model and a heuristic algorithm(called Contention-Aware Routing,CAR),are further proposed to minimize communication conflicts based on guaranteed thermal reliability and maximize communication energy efficiency in the presence of on-chip thermal variations.The proposed routing approaches achieve excellent performance with a largely reduced design space exploration complexity by applying the criterion.These techniques are applicable to 2D-Mesh and 2D-Torus topologies and scalable to large-size ONoCs.
Keywords/Search Tags:Optical Network-on-Chip based Many-core System, Dark Silicon, Thermal Reliability, On-chip Thermal Monitoring, Inter-processor Communication Optimization
PDF Full Text Request
Related items