Font Size: a A A

Research On Soft Error Resilience Analysis And Fault-tolerance Strategy For GPGPU Architecture

Posted on:2022-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H S YueFull Text:PDF
GTID:1488306758979159Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Currently,many industrial and academic organizations utilize General-Purpose Graphics Processing Units(GPGPUs)for High-Performance Computing(HPC)because of their massively parallel computing capability and improved programmability.Unlike traditional GPUs mainly used for graphics computing,GPGPUs are widely used in general-purpose HPC programs such as scientific computing,machine learning,and data mining.Therefore,ensuring their reliability under soft error becomes particularly critical.To achieve the collaborative optimization of GPGPU reliability,performance,and energy consumption,we conduct a systematic study to optimize the energy efficiency of fault-tolerant technologies in GPGPUs.The research covers soft-error resilience analysis,prediction models,and fault-tolerant strategies.The purpose of error resilience analysis is to characterize the tolerance of GPGPU programs to soft errors,to support the effective identification of error resilience regions for GPGPU programs,and further to provide helpful knowledge for effective fault tolerance design;To accelerate the error resilience analysis process,the soft error prediction model aims to mine the error resilience heuristic features of GPGPU programs and further drive the machine learning model to reveal the hidden interactions among program error resilience and heuristic features,thereby supporting fast and efficient error resilience prediction;According to the conclusions of error resilience analysis and prediction,the soft error fault tolerance strategy aims to improve the reliability of GPGPU through techniques such as redundancy and ECC mechanisms.In summary,we make the following key contributions in this study:1.Soft Error Resilience Analysis: Considering some GPGPU programs are inherently tolerable for some errors,inspired by the concept of ap-proximate computing,we first propose a GPGPU-based Soft-Error aware APproximation analysis framework,G-SEAP,to explore the approximation characteristics of faulty outcomes incurred by soft errors.Different from previous FI frameworks that consider all Silent Data Corruptions(SDCs)incurred by soft errors are intolerable,in contrast,G-SEAP uses an applicationspecified quality metric to quantify the difference between SDCs and errorfree results.If the difference meets the user-defined Target Output Quality(TOQ),we refer to such SDCs as SDCs-acceptable,implying that such SDCs do not noticeably affect the execution correctness and outputs can be used as approximate results.Leveraging G-SEAP,we exhaustively analyze 17 representative HPC benchmarks and observe 72.7% of SDCs on average are approximable.In addition,we find that the dataflow of the application,kernel function reliability requirement,instruction-type,and data bit position are all essential factors for the program's correctness.2.Soft Error Prediction Model: Secondly,we build a GPGPU-based Soft Error Prediction Model,G-SEPM,which can replace Fault Injection(FI)to estimate the resilience characteristic of individual fault sites accurately and efficiently.Our key insight is that we discover instruction-type,bit-position,bit-flip direction,and error propagation information have capabilities to describe fault site resiliency in GPGPUs.Leveraging these heuristic features,G-SEPM drives the machine learning model to reveal the hidden interactions among fault site resiliency and our observed features.Experimental results demonstrate that G-SEPM achieves an average accuracy of 93.92% for fault site error estimation and can cover 95.99% of critical fault sites with 95.39%precision.On average,G-SEPM obtains a speedup of 6557 X over FI.Based on G-SEAP and G-SEPM,we further design two energy-efficient soft error fault tolerance strategies for GPGPU:1.To solve the energy-inefficient of Error Correction Code(ECC)mechanism in GPGPUs register file,we propose to leverage the error sensitivity of instructions,the duplicate characteristics of same-named registers,and the errorsensitivity of data-bits to build a unified energy-Efficient ECC mechanism for GPGPUs register file(Eff-ECC).Eff-ECC consists of Instruction Aware ECC(IA-ECC),Duplication Aware ECC(DA-ECC),and Bit Aware ECC(BA-ECC).Considering the error-sensitivity of instructions,IA-ECC merely implements ECCs for the write-registers of critical instructions.Observing same-named registers across threads usually keeps the same data,DA-ECC avoids unnecessary ECC generation and verification for duplicate register values.Leveraging the inherent error tolerance features of the program,BAECC merely protects significant bits of registers to combat the crucial error.Experimental results demonstrate that Eff-ECC tremendously reduces86.46% energy consumption of traditional SEC-DED ECC.Moreover,the energy efficiency of Eff-ECC gives a feasible ECC solution for future lower-power embedded GPGPU systems.2.We design an approximate instruction duplication technique to mitigate the impact of soft errors occurring during the execution of instructions.Based on the analysis of G-SEAP at the instruction level,we observe errors occurring in some instruction-type have negligible impacts on the program's execution correctness.Thus,we improve the efficiency of the instruction duplication technique by relaxing the protections of error-insignificant instruction.Experimental results exhibit that our technique decreases the proportion of SDCs from 70.51% to 5.19% by intelligently choosing 33.70% of dynamic instructions for duplicate execution,which corresponds to 49.86% less than the precise instruction duplication technique and achieves a good tradeoff between reliability and performance.In summary,this study first proposes the error resilience analysis framework and prediction model for GPGPU programs,which can effectively measure the impact of soft errors in GPGPU.Based on the relevant analysis and prediction conclusion,we further design an energy-efficient ECC mechanism and an approximate instruction duplication technique to improve data storage and instruction execution's soft error reliability.This study aims to build more “error-efficient”GPGPU soft error fault tolerance technologies and methods,avoid excessive disturbance of soft errors to the computing system,and make the computing system better balance performance,energy consumption,and reliability.
Keywords/Search Tags:GPGPU, Soft Error, Error Resilience, Error Prediction, Instruction Duplication, Error Correction Code(ECC)
PDF Full Text Request
Related items