Font Size: a A A

An Automated Fault-tolerance System With Breakpoint Recovering In Cloud Computing Platform

Posted on:2018-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:H X XuFull Text:PDF
GTID:2348330521950951Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Cloud computing plays an increasingly important role in scientific research,reducing IT infrastructure investment in small and medium-sized enterprises,optimizing resource utilization and so on.With the increasing demand for computing power,the number of nodes in the cloud computing platform is increasing.The development of hardware technology makes the reliability of the node has been greatly improved,however,although the possibility of a single node failure is small,the overall failure rate of cloud computing platform is large.Thus,the availability and reliability of the cloud computing platform will be reduced severely,and the task will be terminated owing to node failure.In addition,frequent node failure result in serious waste of resources such as computing and storage.In order to solve the problems caused by frequent node failure in the cloud computing platform,the main method is fault-tolerance.However,there are some shortcomings in the current fault-tolerance such as excessive overhead in time and memory,high labor costs,low detection accuracy.In this case,this paper has designed and implemented an automated fault-tolerance system with breakpoint recovering in cloud computing platform according to the actual demand.The major research efforts are following four aspects:1.This paper analyzes the impact of frequent node failure in cloud computing platform and then summarize the functional modules that fault-tolerance system required.On this basis,this paper design and implement an efficient and practical fault-tolerance system in three aspects: cloud computing platform architecture and passive fault-tolerance and proactive fault-tolerance.It can improve the availability and reliability of the cloud computing platform and solve single point of failure,data loss,task failure and other issues caused by node failure.Through a simple fault-tolerance system deployment,the node failure can be detected automatically and the overhead in time and memory can be reduced compared to others.2.A scheme of task breakpoint saving and recovering in cloud computing platform is proposed.By the network file system,the breakpoint can be saved reliably and task will be recovered from break point automatically after the node failure occurred.3.This paper propose a proactive fault-tolerance method that can predict the node's status by load evaluation.When the node load is too high,the method select some virtual machines to live migration.Thus,the influence of node failure can be avoided and the overall failure rate of cloud computing platform will be reduced.4.Based on Open Stack,a cloud computing platform that overall structure is highly reliable has been implemented.This paper evaluate the system's passive fault-tolerance and proactive fault-tolerance in this platform.The experimental results show that this paper's fault-tolerance system can detect node failure and recover from failure automatically.Besides,the waste of computing resource due to node failure can be reduced.
Keywords/Search Tags:cloud computing, automated fault-tolerance, breakpoint recovering, platform monitoring
PDF Full Text Request
Related items