Font Size: a A A

Research On Implementation Technologies Of Checkpoint System And Optimization Of Performance

Posted on:2006-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiFull Text:PDF
GTID:2178360185996952Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the increment of the scale of cluster, the combined failure probability increases. Node failure makes task interrupted possible when it is running on the cluster, so induces huge waste of resources and even prevents the task completing finally. Checkpoint System provides good fault tolerance for computing nodes and becomes important cluster operating system software, and to implement single-node checkpoint system is an important precondition to build parallel checkpoint system and fault-failure environment for the whole cluster.This project contributes to provide a strong and flexible single-node checkpoint system to Dawning 4000 cluster. Dawning 4000 is based on Linux OS and AMD Opteron CPU , but checkpoint system supporting this platform with open source code or related materials cannot be found in the world yet.In the first place, this paper analyses and compares the existing checkpoint systems including single-node checkpoint systems, parallel checkpoint systems and checkpoint models which are integrated in business OS. Then, we deeply researched on the technologies needed to implement the checkpoint function, especially, the detailed flow design of a classic system-level checkpoint system—BLCR is discussed.As for x86-64 architecture, we analyze its new features and how Linux supports it. With this effort, we implement a system-level single-node checkpoint system on the base of BLCR. Further more, the platform dependence character of system-level checkpoint is discussed from two aspects: Os and CPU.At last, two optimization strategies are given in this paper. Combined written strategy resolves the performance drop problem and A-O storage strategy improves the time performance of checkpoint system. Experimental results indicate that A-O strategy can reduce checkpoint time overhead to 50% in the best case.
Keywords/Search Tags:cluster, fault tolerance, checkpoint system, store strategy
PDF Full Text Request
Related items