Font Size: a A A

Fault-tolerant And Energy-aware Algorithms For Workflows And Real-time Systems

Posted on:2021-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L HanFull Text:PDF
GTID:1368330629480890Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The massive calculation requirements from scientific simulations have become the most direct driving force for the development of high-performance computers(HPC).Although the improvement of computing power drives new breakthroughs in numerous scientific areas,it also brings new challenges.This thesis is focused on the two major problems in the high per-formance computing context:resilience and energy consumption.To satisfy the computing power required by modern scientific research,the number of computing units in supercomput-ers increases dramatically in the past years.This leads to more frequent errors than expected.Obviously,failure handling is critical for highly parallel applications that use a large num-ber of components for a significant amount of time.Otherwise,one may spend infinite time re-executing.At the other side,power management is necessary due to both monetary and envi-ronmental constraints.Especially because resilience often calls for redundancy in time and/or in space,which in turn consumes extra energy.In addition,technologies that reduce energy consumption often have negative effects on performance and resilience.As a consequence,we must take into consideration the reliability and performance degradation when managing power.In this context,we re-design scheduling algorithms to investigate trade-offs between performance,resilience and energy consumption,thus solving challenging multi-criteria opti-mization problems in large-scale computing.To be specific,1.This thesis studies the scheduling and checkpointing techniques(redundancy in time)to deploy scientific workflows on large parallel platforms,to solve the problem of resilience and minimizing the total execution time(makespan).The proposed solution consists of two phases:an allocation of tasks on processors;a decision on which task to checkpoint.This thesis identifies the importance of avoiding crossover dependencies to prohibit pro-cessor interference,then designs optimal solutions for special classes of task graphs,and provides general-purpose heuristics for arbitrary ones.1)For a particular class of task graphs,namely M-SPGs,at the mapping phase,we take advantage of its recursive struc-ture and apply proportional mapping to schedule it as sets of superchains.By checkpoint-ing all exit tasks in each superchain,one could get the minimal set of tasks to checkpoint in order to avoid crossover dependencies.At the next stage,we propose a novel idea of "task checkpoints" which enables a dynamic programming to optimally decide where to put checkpoints inside each superchain.2)For an arbitrary task graph,at the map-ping phase,we resort to classical scheduling heuristics(i.e.,HEFT and MINMIN)and provide their variations that reduce the crossover dependencies.At the next stage,we combine them with several checkpointing strategies to provide both fine and loose so-lutions for different situations.To the best of our knowledge,this is the first general solution that copes with unlimited number of failures for arbitrary workflow.To assess the performance and effectiveness of proposed algorithms,this thesis designs a discrete-event simulator and compares our methods to:an approach in which all application data is checkpointed(CKPTALL),which corresponds to the standard way in which most pro-duction workflows are executed today;and an approach in which no application data is checkpointed(CKPTNONE).Exhaustive simulations demonstrate that our algorithm out-performs both the former approach,because of lower checkpointing overhead,and the latter approach,because of better resilience to failures.In theory,this thesis establishes the#P-completeness of computing the expected makespan of CKPTNONE strategy.2.This thesis explores the scheduling and replication strategies(redundancy in space)of periodically independent task sets,to solve the problem of energy minimization with deadlines and reliability constrains.A solution consists of three steps:determine the con-figuration of replicas for each task;map and static schedule each replica onto a proces-sor;dynamically update the schedule while taking into account actual execution times.All stages aim at minimizing the overlaps between replicas,thus minimizing the total energy consumption.This study is done first for homogeneous systems and then for heterogeneous systems.1)For homogeneous platforms,this thesis considers a more re-alistic formula for expected energy consumption at the replica setting phase;proposes a layered WFD mapping heuristic in order to spread primary replicas over processors and lead to a more balanced load;designs several scheduling heuristics to minimize overlaps,which is achieved by re-ordering chunks between deadlines,pre-fetching available pri-mary replicas and taking advantages of idle times.2)For heterogeneous platforms,as the heterogeneity makes the problem more complex,one could not know the number of replicas needed before actually assigning a task to a processor.This thesis provides sev-eral criteria to guide the design of heuristics at each phase.This thesis also develops a discrete-event simulator.Extensive simulations are conducted for various range of param-eters and execution scenarios,to evaluate the effectiveness of proposed approach in terms of achieving reliability goal and minimizing energy consumption.In theory,this thesis calculates a theoretical lower bound to quantitatively evaluate the absolute performance.Results showed that our best heuristic always achieves an excellent performance that is very close to this bound.Moreover,several complexity results for the sole scheduling phase are provided.
Keywords/Search Tags:high performance computing, scheduling algorithms, real-time systems, resilience, energy consumption
PDF Full Text Request
Related items