Font Size: a A A

Improving Efficiency to Advance Resilient Computin

Posted on:2019-01-19Degree:Ph.DType:Thesis
University:University of Southern CaliforniaCandidate:Li, JiFull Text:PDF
GTID:2478390017987713Subject:Electrical engineering
Abstract/Summary:
Resilience is a major roadblock for high-performance computing (HPC) executions on future exascale systems, as the increased likelihood of much higher error rates results in systems that fail frequently and make little progress in computations or in systems that may return erroneous results [CGG+14, SWA+14]. Meanwhile, hardware failure mechanisms are impacting the resilience of commercial electronic systems at ground level [MBS10]. Therefore, it is imperative to develop resilient computing techniques for both high-end computing systems and commercial electronic systems, in order to keep applications running to correct solutions despite the underlying hardware failures.;Among all the hardware failure mechanisms, radiation-induced soft errors have become one of the most challenging issues [KMH12, WDT+14], which can lead to silent data corruptions and system failures, with potentially disastrous results in mission-critical systems such as mainstream servers, automobiles and spacecrafts [Nic10]. Hence, the first part of the thesis is dedicated to a classical resilient computing problem: what is the Soft Error Rate (SER) of a circuit?;In the process, Deep Neural Network (DNN) and Deep Convolutional Neural Network (DCNN) have emerged as high performance resilient systems, which completely tolerate radiation-induced soft errors. More importantly, DNN and DCNN have achieved breakthroughs in many application fields that require detection and recognition, such as image classification, pattern recognition, and natural language processing [LBH15]. Nevertheless, there are two challenges faced by these high performance resilient systems: (i) how to extend the success of such resilient systems from detection and recognition tasks to complicated control problems which have broader impacts, and (ii) how to promote the adoption of such resilient systems that are usually implemented in high-performance server clusters to the widespread IoT and wearable devices with limited computation capacities.;Accordingly, the second part of this thesis is dedicated to solve the aforementioned challenges. A Deep Reinforcement Learning (DRL)-based framework is proposed, which utilizes the resilient DNNs together with the reinforcement learning method to solve one complicated control problem, i.e., cloud computing resource allocation problem, which cannot be resolved efficiently by previous algorithms. Then, a Stochastic Computing (SC)-based DCNN architecture is proposed, which maps the latest DCNNs to application-specific hardware, in order to achieve orders of magnitude improvement in performance, energy efficiency and compactness. Unlike traditional binary computing systems, SC-based DCNN architecture is resilient to radiation-induced soft errors, and the main source of errors is the inaccuracy in SC components and hardware-based network design. Hence, the accuracy improvement of the state-of-the-art DCNNs is treated as the main objective together with the power/area/energy efficiency in this part.;In conclusion, this thesis is dedicated to improving the efficiency of resilient computing through both a classical approach, i.e., fast and comprehensive SER evaluation framework for conventional computing circuits, and another novel approach in parallel involving the extension of the emerging resilient DNNs for complicated control problems with broader impacts and improving the efficiency of resilient DCNNs for widespread deployment in IoT/wearable devices.
Keywords/Search Tags:Resilient, Efficiency, Systems, Improving, Computing, DCNN, Complicated control, Radiation-induced soft errors
Related items