Research On Self-adaptive Fault-tolerent Techniques For Many Core Processors

Posted on:2017-08-25

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W T Jia

Full Text:PDF

GTID:1318330536467111

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As feature size of semi-conductor devices continues to shrink,the processors suffer from various serious errors including soft errors,hard error and process variability.This induces default-tolerant error recurrence in non-critical applications whose errors are rare and can be negligible before.Many-core processors deliver higher performance and lower power density than multi-core processors,and therefore are increasingly widespread.Many-core architectures are very different: each core is simple and generally does not use forward execution and branch prediction techniques;communicating with on-chip networks rather than buses;Hardware Cache consistency is difficult to ensure and use software Cache or local storage;operating systems typically only run on the control core rather than every core and so on.These make fault-tolerant techniques of many-core processor are very different form multi-core processors,so low-overhead fault-tolerant researches on many-core processor are very necessary.Many-core processor is fit for compute-intensive applications and is not suitable for control-intensive applications.It is less used on critical-application like aerospace and energy.Therefore,this article primarily research fault-tolerant for general-purpose application(GPA).GPA need low-overhead fault-tolerant mechanisms,unbearable overhead of traditional methods such as triple modular redundancy(TMR)and even dual-mode redundancy(DMR).There are many variable factors of many-core processor running GPA.Processors integrated hundreds of cores typically run multiple applications simultaneously,but fault-tolerant requirements are different for each application.The efficiency of many-core processor is usually low and will changes as application running.Chip error rate affected by temperature,voltage,frequency and other operating environments is variable as the chip operating environment changing.Focusing on the fault-tolerance system of many core processors in GPA,this paper develops several adaptive techniques for different aspects,such as fault-tolerance requirement,computational efficiency and error rate and so on.It dynamically adjusts fault-tolerant ways to reduce overhead.The main research work and innovation of this paper is as follows:� To reduce hardware overhead of heavy weighted redundancy methods,we proposed a lightweight redundancy method for dynamic coupled redundancy pair which decreases error detection and recovery overhead in a soft-and hard-ware co-design way.Redundancy pair is two processor cores of detecting their executed results' errors over the same program.Usually redundancy techniques in order to improve performance will modify the original architecture or add a lot of hardware.Our method makes use of their own resources to build redundant pairs for lightweight fault-tolerant,which does not modify the core architecture but adding a little hardware.This paper uses hardware to compare results,which increasing a little hardware but can greatly reduce delays and improve recovery coverage.We reduce the overhead of the checkpoint updating by co-design way.Redundancy overhead is reduced by 22 %.� To overcome small error coverage of abnormal detection techniques,we proposed a method for detecting abnormal actions among processor cores,and further improve error coverage through mixing DMR.Abnormal detection technique without redundancy can detect errors by detecting abnormal behaviours,such as the processor instruction overflow and address access violation,in the minimal overhead.Anomaly detection focuses on how to distinguish between normal and abnormal behaviours.Rare event is often treated as an exception while,in this paper,we treat the behaviour differences among the processor cores as the exception,which includes usual exception as a special instance.This method can detect abnormal behaviour by comparing behaviour differences such as the number of instruction execution,memory access number among several processor cores.It can greatly increase error coverage and reduce misjudgements,and thus error detection failure rate can be reduced from 57 % to 10 %.Since sensitivities vary from programs,error rates of systems over different programs will become distinctive from each other.To further reduce the failure rate of abnormal error detection,this paper makes use of DMR for higher error rate.The overhead is 16 % while failure rate can be further reduced from 10 % to 4.5 %.� Since full DMR reduce throughout by half,We propose a partial redundant method based on application's fault tolerant requirement,and schedule redundant core processor to further reduce overhead.Most methods duplicate all applications of system to fault tolerance.This paper only duplicates applications that require fault tolerance to reduce the proportion of redundancy.Redundancy techniques use two core execute the same program,we refer the core responsible for the input and output to be main core,another called redundant core.The number of redundant core usually is equal to the number of main core.We use fewer redundant cores to duplicate more main cores to reduce the number of redundant core.Especially when application' requirement of fault tolerance is low,inefficient core can use to act as a redundant core,thereby greatly reducing the cost of fault tolerance.� To reduce checkpoint overhead of fixed technique in the situation of error rate variable,we propose a self-adaptive checkpoint(SACP)method which dynamically matches checkpoint intervals with error rate.The most popular recovery method is checkpoint,and the intervals of checkpoint can obviously influence performance.However,most ways to determine intervals of checkpoint relying on constant SER(soft error rate).Differently,the view of SACP is to analyze occurrence of errors more carefully and match checkpoint interval with real time SER dynamically.But benefit of SACP is relative with SER variation,so we have to evaluate impact of SER variability on self-adoptive checkpoint.We study impacts of theoretical variable SER on checkpoint overhead;propose a way to predict SER based errors occurred most currently,showing practical benefits of self-adoptive checkpoint.Results show our method can improve performance in the situation of variation above3 X amplitude and sustained time more than 12.5%.This paper focuses on the fault-tolerance system of many core processors in GPA,and develops several adaptive techniques for different aspects,such as fault-tolerance requirement,computational efficiency and error rate and so on.They can significantly reduce the cost of fault-tolerance of many-core processors with a strong practical significance.

Keywords/Search Tags:

Fault-tolerance, Many Core Processors, Self-adaptive, Low overhead, Error rate variation, Checkpoint

PDF Full Text Request

Related items

1	The Research And Implementation Of Checkpoint Technology Based On WinNT
2	The Research On Low-overhead Rollback Recovery Fault-Tolerance Technology
3	Research And Optimization Of Adaptive Checkpoint Technique In Map Reduce
4	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
5	Modeling Communication Overhead Of ARM Multi-Core Processors On Android Applications
6	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
7	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
8	Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint
9	Dynamic Adaptive Checkpoint Mechanism For Streaming Applications Based On Reinforcement Learning
10	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems