Font Size: a A A

Design And Implementation Of Fault Tolerance Technology For Distributed System

Posted on:2016-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Z ZhangFull Text:PDF
GTID:2428330482958386Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet and social media,the requirements from governments and enterprises for massive data storage and management have grown dramatically.The centralized data management systems do exhibit high risk in data man-agement due to its nature of single failure,as well as its insufficient computation power on massive data processing.Therefore,distributed data management system deployed over clusters is necessary in mission-critical applications.It is able to enhance the data processing capacity in a great deal,as well as the system availability.Recently,distributed systems have acquired more and more attentions from both industry and academic.It has been widely deployed in the fields,such as civil aviation,financial,industrial control con-tour and etc.Although the distributed system is able to achieve higher reliability than that of the centralized systems,it does exist a set of issues,such as the partial failure,incon-sistent clock across machines,unreliable message transmission.The distributed system is error-prone and may cause even more dramatically economic losses in the end.Therefore,more sophisticated fault-tolerance technologies are required to improve system reliability and availability.Nowadays,the in-memory computing system deployed over a cluster based on shared-nothing machines has been advocated by most of the companies for real-time data anal-ysis.CLAIMS is one of the systems,designed to take advantage of fast memory com-putation to achieve real-time data processing.The speed of accessing in-memory data is about 200 times faster than the speed of accessing disk data.Nevertheless,the data stored in memory is volatile.It requires more to achieve higher reliability and availability for fault-tolerance.Therefore,how to improve the fault-tolerance in a distributed in-memory computing system to ensure its reliability and availability becomes the first issue to build the system.The state-of-the-art fault tolerance strategy is still relatively simple,and failed to meet the growing demand in a massive parallel processing system.CLAIMS system is facing the same problem,and is necessary to be designed with a complete set of fault tolerance strategies to achieve higher availability.In general,the fault-tolerance strategies include component backup,checkpoints se-tups,operation migration and etc.They can solve the basic problems in fault-tolerant distributed systems to a certain degrees,but this is far from enough.When we are facing with the complex long-running queries,it is necessary to adopt a hybrid fault tolerance strategies,each of which is optimized with more advanced techniques.Leveraging on the CLAIMS system(a distributed in-memory database system),we designed and implement-ed its fault-tolerance module with hybrid strategies,including the k-safe multi-projector in-memory storage,adaptive dynamic heartbeat detection mechanism and selective dy-namic checkpoint setup strategy,as well as provided with a programming framework integrated with QoS functions.Eventually,we proved that our fault tolerance strategy has a better performance for OLAP processing task with extensive experiments.In this paper,our contributions are listed as follows:1.Based on HDFS,we redesigned the underlying file storage system.Instead of using the single file,multiple replications strategies,we designed the k-safe file storage strategies.Each original relational table is projected into a set of projections,with the constrain that each column in the original table has k replications in all these projections at least.Each projection is further partitioned into a set of chunks,each of which is 64MB in capacity.The combination of such vertical and horizontal partition mechanism is convenient for data loading.Furthermore,all the data stored in the system is guaranteed to be available,even when part of the system is failed when processing the query.For fault-tolerant storage system,we create a solid foundation for queries.2.Secondly,we added adaptive heartbeat detection mechanism into our fault-tolerant system,instead of the traditional fixed frequency heartbeat messaging strategy.There-after,the average time to identify machine failure is reduced.Specifically,the frequency for heartbeat exchange can be dynamically adjusted between machines,which is depend-ed on the workload of the running system.Considering the size of the cluster,the load of the heartbeat messages may be large.We proposed a multicast strategies for heartbeat to reduce the latency of detection of machine failure with moderate costs on message communication.3.Thirdly,we added QoS service functionality into CLAIMS system for fault-tolerance,by dividing operations into different functionalities.The task processing in each stage can be quantized in this way,which is an indication of the system performance.At the same time,we use the OpenMPI programming library to achieve fault-tolerance.The architecture for fault-tolerance can been improved with timing messages as well as data exchange based on this programming model.4.Last but not the least,we proposed a method which sets checkpoints for each execution plan dynamically and calculates the execution time with a corresponding time cost model.Leveraging on the idea of dynamic programming for global optimization,we calculate the execution time for each stage of the query plan iteratively based on the cost model first,and compare the disk I/O overhead thereafter,as well as the cost of backtracking from the bottom leaf node of the execution plan to the very stage,and we finally setup checkpoints into the physical plan.Eventually,the entire cost to resume part of the execution plan is minimized,in cases of partial failure of the system.By combining all strategies listed above,we provide a complete solution for the fault-tolerance in CLAIMS system.As the extensive experimental studies given in the paper,we proved that the method proposed in this paper has significantly improved the efficiency of fault detection and recovery in terms of reliability and availability.
Keywords/Search Tags:Big data, Distributed cluster, In-memory computing, Fault tolerance and Recovery
PDF Full Text Request
Related items