Font Size: a A A

Design And Implementation Of The Message Communication And Display Frameworks Of An Autonomous Fault Management System For Supercomputers

Posted on:2019-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:T XiaoFull Text:PDF
GTID:2428330623950928Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous improvement of the performance of supercomputers,the system scale is also growing rapidly,which brings about a critical issue that supercomputers are having more frequent failures and the reliability is facing more and more serious threats.Reliability limits the further increase of the system scale and thus the further improvement of the performance of supercomputers,which is called as "Reliability Wall" and is a severe challenge facing the development of current and future supercomputers.To deal with the "Reliability Wall" challenge,a comprehensive solution – an autonomous fault management system for supercomputers(AFMSS)is proposed and implemented,which automically manages the entire life cycle of faults,including fault detection,fault diagnosis,fault isolation,task recovery,etc.AFMSS can dramatically improve the efficiency and reduce the cost of fault handling,thus improving the reliability of large-scale supercomputer systems.AFMSS is a huge and complex system with a set of important capacities and this dissertation focuses on the design and implementation of two of them – unified message communication and fault information display.Specifically,the following two aspects of work have been carried out:(1)Design and Implementation of the Message Communication FrameworkThe Message Communication Framework is for unified message communication and consists of a hierarchical architecture and a publish/subscribe-based inter-module collaboration mechanism.The hierarchical architecture decomposes the fault management into functional units and allocates them into multiple layers for implementation,among which only the bottom layer is deployed on each node of the supercomputer and is merely responsible for fault detection as well as simple fault diagnosis and handling,while complex functions are left to upper layers.Upper layers are deployed on separate management servers and can thus utilize much more resources to accomplish more sophisticated functions such as more complex fault diagnosis and handling from the perspective of a larger set of nodes.The hierarchical architecture can not only alleviate the impact of fault management on the performance of the supercomputer's node,but also support a more effective management of faults from the perspective of node sets and the whole supercomputer system.What's more,it ensures a good scalability of AFMSS in scale.In the publish/subscribe-based inter-module collaboration mechanism,all of the functional modules are classified into three types,i.e.,publisher,subscriber as well as subscriber/publisher modules,and corresponding implementation interfaces are provided.These modules simply communicate with the event service module by publishing to or/and receiving from it fault events,in this way the entire process of fault management is driven by fault events flowing between modules and between layers.This mechanisam unifies the communication modes of various functional modules as well as the working modes of all subsystems at all layers and makes the logical structure of AFMSS clear,reducing the development difficulty as well as workload and guaranteeing a good functional scalability of AFMSS.(2)Design and Implementation of the Message Display FrameworkThe Message Display Framework is for fault information display to make it easier for system administrators and maintainers to acquire the health condition of the whole supercomputer system and locate faulty nodes for necessary hardware repair/replacement operations.The framework adopts a C/S architecture.The server,as a subscriber module in the top-layer subsystem,subscribes to the event service module for all of the fault events and periodically formats important fault information into an SCSDL document which is sent to the client.The client unmarshals the SCSDL document after receiving and visualizes the containted fault information through the GUI.The SuperComputer System Description Language(SCSDL)has a powerful expressive ability to describe the actual physical layout of computing nodes and the status of each node in a supercomputer of any scale.With the help of SCSDL,the server compresses the fault information and the client achieves the separation of the content and view of the display interface,which makes the Message Display Framework well scalable and flexible.The deployment experiment of a prototype system on Tianhe-2 supercomputer verifies the feasibility and effectiveness of the work of this dissertation.The work is a positive exploration and useful trial of fault management and status information display for large-scale petascale systems and future exascale systems.
Keywords/Search Tags:Message Communication Framework, Message Display Framework, Autonomous Fault Management, Supercomputer
PDF Full Text Request
Related items