Font Size: a A A

Research And Implementation Of A Cluster-oriented Fault Management System

Posted on:2015-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:L ChengFull Text:PDF
GTID:2308330479979482Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As with the unceasing progress made on scientific technology, the range of our demand for high performance computing is increasingly expanding. And cluster computing system have quickly become the mainstream in the realm of high performance computing with its comparatively high ratio of performance and price as well as its excellent scalability. However, along with the constant pursuit of higher computing peak, the scale of the system is continuing to rise, causing higher occurring probability of system errors and faults which is beyond the efficient scope of current system methods of management and maintenance. In this context, it’s of special importance to research on the efficient cluster-oriented fault management approach to improve system availability and reliability as well as to provide a stable computing environment for end users.This paper, on the basis of full analysis of structure and characteristics of cluster system and the flaws of current management system, proposes a new hierarchical structure of the fault management system. And detailed research has been conducted on the key technologies concerning the system. The main work done as well as the innovations in the paper are as follows:(1)This paper, based on the characteristics of cluster systems, proposes a self-similar and hierarchical fault management system structure, which could adapts to function needs of fault management of various cluster system and achieve good scalability.(2) In order to meet the need of communication among different system modules, this paper designs a unified coding scheme for errors and fault-related information, which could be used to describe semantics of the detected fault information in a fine-grained way and support high efficient automatic decoding.(3) This paper studies and implements a new rapid discovery technology to achieve real-time detection of fault at a low overhead through the operating system error-reporting branch stub. And fault detection for devices like CPU,memory, PCI-E bus, disk and so on as well as the reporting and summary mechanism of faults is also implemented, providing a primary basis for subsequent fault diagnosis, affected domain analysis and fault handling.
Keywords/Search Tags:cluster, fault management, system architecture, grammar of information description, fault detection
PDF Full Text Request
Related items