Fault-tolerant And Energy-aware Algorithms For Workflows And Real-time Systems

Posted on:2021-04-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Han

Full Text:PDF

GTID:1368330629480890

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The massive calculation requirements from scientific simulations have become the most direct driving force for the development of high-performance computers(HPC).Although the improvement of computing power drives new breakthroughs in numerous scientific areas,it also brings new challenges.This thesis is focused on the two major problems in the high per-formance computing context:resilience and energy consumption.To satisfy the computing power required by modern scientific research,the number of computing units in supercomput-ers increases dramatically in the past years.This leads to more frequent errors than expected.Obviously,failure handling is critical for highly parallel applications that use a large num-ber of components for a significant amount of time.Otherwise,one may spend infinite time re-executing.At the other side,power management is necessary due to both monetary and envi-ronmental constraints.Especially because resilience often calls for redundancy in time and/or in space,which in turn consumes extra energy.In addition,technologies that reduce energy consumption often have negative effects on performance and resilience.As a consequence,we must take into consideration the reliability and performance degradation when managing power.In this context,we re-design scheduling algorithms to investigate trade-offs between performance,resilience and energy consumption,thus solving challenging multi-criteria opti-mization problems in large-scale computing.To be specific,1.This thesis studies the scheduling and checkpointing techniques(redundancy in time)to deploy scientific workflows on large parallel platforms,to solve the problem of resilience and minimizing the total execution time(makespan).The proposed solution consists of two phases:an allocation of tasks on processors;a decision on which task to checkpoint.This thesis identifies the importance of avoiding crossover dependencies to prohibit pro-cessor interference,then designs optimal solutions for special classes of task graphs,and provides general-purpose heuristics for arbitrary ones.1)For a particular class of task graphs,namely M-SPGs,at the mapping phase,we take advantage of its recursive struc-ture and apply proportional mapping to schedule it as sets of superchains.By checkpoint-ing all exit tasks in each superchain,one could get the minimal set of tasks to checkpoint in order to avoid crossover dependencies.At the next stage,we propose a novel idea of "task checkpoints" which enables a dynamic programming to optimally decide where to put checkpoints inside each superchain.2)For an arbitrary task graph,at the map-ping phase,we resort to classical scheduling heuristics(i.e.,HEFT and MINMIN)and provide their variations that reduce the crossover dependencies.At the next stage,we combine them with several checkpointing strategies to provide both fine and loose so-lutions for different situations.To the best of our knowledge,this is the first general solution that copes with unlimited number of failures for arbitrary workflow.To assess the performance and effectiveness of proposed algorithms,this thesis designs a discrete-event simulator and compares our methods to:an approach in which all application data is checkpointed(CKPTALL),which corresponds to the standard way in which most pro-duction workflows are executed today;and an approach in which no application data is checkpointed(CKPTNONE).Exhaustive simulations demonstrate that our algorithm out-performs both the former approach,because of lower checkpointing overhead,and the latter approach,because of better resilience to failures.In theory,this thesis establishes the#P-completeness of computing the expected makespan of CKPTNONE strategy.2.This thesis explores the scheduling and replication strategies(redundancy in space)of periodically independent task sets,to solve the problem of energy minimization with deadlines and reliability constrains.A solution consists of three steps:determine the con-figuration of replicas for each task;map and static schedule each replica onto a proces-sor;dynamically update the schedule while taking into account actual execution times.All stages aim at minimizing the overlaps between replicas,thus minimizing the total energy consumption.This study is done first for homogeneous systems and then for heterogeneous systems.1)For homogeneous platforms,this thesis considers a more re-alistic formula for expected energy consumption at the replica setting phase;proposes a layered WFD mapping heuristic in order to spread primary replicas over processors and lead to a more balanced load;designs several scheduling heuristics to minimize overlaps,which is achieved by re-ordering chunks between deadlines,pre-fetching available pri-mary replicas and taking advantages of idle times.2)For heterogeneous platforms,as the heterogeneity makes the problem more complex,one could not know the number of replicas needed before actually assigning a task to a processor.This thesis provides sev-eral criteria to guide the design of heuristics at each phase.This thesis also develops a discrete-event simulator.Extensive simulations are conducted for various range of param-eters and execution scenarios,to evaluate the effectiveness of proposed approach in terms of achieving reliability goal and minimizing energy consumption.In theory,this thesis calculates a theoretical lower bound to quantitatively evaluate the absolute performance.Results showed that our best heuristic always achieves an excellent performance that is very close to this bound.Moreover,several complexity results for the sole scheduling phase are provided.

Keywords/Search Tags:

high performance computing, scheduling algorithms, real-time systems, resilience, energy consumption

PDF Full Text Request

Related items

1	Energy-aware Task Scheduling Algorithms And Application For High-performance Computing
2	Safety Critical Embedded Systems Of Energy Consumption Measurement Analysis And Real-time Scheduling
3	Research On Real-Time Task Scheduling Approaches For Energy Consumption And Temperature Optimization
4	Research On Transactions Scheduling For Saving Energy Consumption In Mobile Real-time Database Systems
5	Research Of Minimizing Energy Consumption Scheduling Algorithm For Parallel Real-Time Tasks In Multicore Systems
6	Scheduling for CPU Packing and node shutdown to reduce the energy consumption of high performance computing centers
7	Research On Energy-Efficient Scheduling Algorithm Of Distributed Real-Time Systems
8	Design And Implementation Of Energy Consumption Data Real-time Acquisition System In High Efficiency Data Center
9	Research On Resource Scheduling Algorithms In Cloud Computing
10	Energy-efficient scheduling algorithms for real-time systems