Font Size: a A A

The Design And Implementation Of A Fault-tolerant Cluster Monitoring System

Posted on:2013-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2248330374986015Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Recently, cluster technology is an important research direction in the highperformance computing system. Cluster system with its high performance price ratio,high scalability, high availability characteristics, rapidly developed into an importantsolution in high performance computing, and it is widely used in the industry ofpetroleum geophysical exploration.This article is the study about the application of cluster system for the seismic dataprocessing. With the number of nodes in the cluster and the number of jobs submitted torun more and more, the monitor and management of cluster system has become a majorproblem. With the continuous expansion in the scale of cluster system, the systemfailure probability grow, together with seismic data jobs are often large data, and the jobexecution time is also very long, if the failure cause jobs to fail to run, it will waste a lotof system computing resources and time resources. Therefore, to study the fault-toleranttechnology of the application cluster system has great significance.Cluster monitoring fault-tolerant system which this paper studies and designs is animportant support system for the seismic data processing application cluster system, andit is developed as a special auxiliary system for the features of seismic data processing.For the above questions, the main work of this thesis includes:Firstly, study the existing cluster monitoring system, design and implement themonitoring module for the seismic data processing platform cluster system, completethe implementation of the entire cluster system’s monitoring information collection,aggregation and display. Monitoring module mainly includes the monitoringinformation of nodes and jobs in cluster, therefore it provides a convenient for systemadministrators and users to manage and monitor cluster.Secondly, design and implement the cluster system fault-tolerant function for thenode fault detection with heartbeat technology, complete follow-up treatment workabout node failure aiming at the seismic data processing application. The cluster nodefault detection and treatment function provide a basis for job fault-tolerant functionwhich uses application level checkpoint operation. Thirdly, on the basis of the existing checkpoint technology, and the particularity ofthe seismic data job and seismic data processing, design and implement the applicationlevel job checkpointing and rollback recovery function based on seismic data unit. Thisfunction combining with node fault-tolerant function, can realize the automatic faulttolerant when jobs run failed. Through the experimental test it verifies the feasibility ofthe application level job checkpoint, and it improves the availability of the clustersystem. It can make job continue to execute from the checkpoint when the job fails, so itcan reduce job repeated execution time and avoid the large waste of the systemcomputing resources and time.
Keywords/Search Tags:cluster, monitor, fault-tolerant, job checkpoint, seismic data
PDF Full Text Request
Related items