Font Size: a A A

Cluster Reliability Approach Based On GPU Energy Consumption Analysis

Posted on:2019-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y L FangFull Text:PDF
GTID:1368330611488652Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Nowadays,with the demand for large-scale computing systems and the development of semiconductor process technology in the fields of Internet of Things,artificial intelligence,life sciences,etc.,GPU clusters bring higher performance to large-scale computing with their high processing power.However,the high integration of the device and the high density of the transistor refresh rate cause the computing system to generate excessive energy consumption,which also leads to an increasingly serious problem of the reliability of the computing system.Therefore,effective energy analysis and reliability improvement method are urgent problems to be solved in large-scale computing system.Considering the above problems,the research content and innovations of this paper are as follows:1.Analyze the use of computing resources and storage resources by different programs for different types of GPUs,and thus define the types of major kernels in different programs.Then,a two-level performance optimization model is proposed to optimize the performance of parallel programs from the thread level and the instruction level.2.Propose a fault-tolerance model based on power consumption computing system.In order to accurately calculate the energy consumption of each processor,a cluster energy consumption calculation system based on wireless sensor network was designed.The energy consumption of all devices was obtained by collecting real-time current and related mathematical models.On this basis,an asynchronous checkpoint scheme for real-time monitoring of multiple indicators is proposed to control the specific checkpoint interval.The checkpoint interval is dynamically adjusted based on current and power consumption changes to reduce fault-tolerant time overhead.In addition,to optimize fault tolerance results,we also adjust checkpoint locations within an optional range to reduce the amount of redundant data that needs to be saved3.A Dynamic Task Flow migration Approach(DTMA)is proposed for low dependency programs.With DTMA,the dynamic task migration method can effectively reduce the energy consumption of node load,improve equipment energy efficiency,effectively avoid transient faults and improve computing system reliability.4.A resource scheduling method based on the natural attribute priority of the dataflow for high dependency programs is proposed.It uses the idea of heuristics and genetic algorithms to find an approximate optimal task-processor allocation scheme under the condition of fully considering GPU and task type.Then,for the problem that high power consumption causes nodes to collapse in large-scale computing,we propose a dynamic resource adjustment scheme to reduce device load.Although reducing the task load affects the performance of certain tasks,it can achieve globally optimal reliability by sacrificing local optimal performance.Finally,using the convolution calculation in deep learning to verify the performance optimization idea of this paper,the average performance is improved by 20.57% for convolution calculations of different scales.On this basis,the reliability model of this paper is used for road vehicle detection.The real-time power consumption of each node is obtained under the long-term running of YOLO v3 model,and task scheduling and migration are performed according to power consumption changes to ensure calculation.The system can operate safely and reliably for a long time.In the current experimental environment,system failure is reduced to zero,which effectively improves reliability.
Keywords/Search Tags:GPU Cluster, Reliability, Fault Tolerance, Energy Consumption Optimization, Task Scheduling, Dataflow Natural Attribute
PDF Full Text Request
Related items