Research On Failure Node Repair Technology In Distributed Storage System

Posted on:2021-08-17

Degree:Master

Type:Thesis

Country:China

Candidate:W Guo

Full Text:PDF

GTID:2518306554966039

Subject:Computer Science and Technology

Abstract/Summary:

The era of big data brings new challenges to the traditional storage system.Due to the insufficiency of the classical storage system on large scale data storage,like poor extensibility,low data security and data overloading in partial nodes,the distributed storage system is emerged as the main solution to the large-scale data storage application.It has many advantages,like excellent extensibility,reliable data security and great throughput time on large scale data read-write operation.However,the adoption of cheap commercial hardware in the entry level equipment of distributed storage system leads to the high data failure rate,the redundancy policy therefore needs to be applied to ensure the data security.Compared with the traditional multi-copy redundancy policy,on the premise of the same redundancy requirements,erasure code can largely reduce the overall storage cost.However,the redundancy policy of erasure code can generate a considerable amount of data-traffic and have long recovery time during node repair process.Currently,researchers focus on improving the encoding and decoding mechanism of erasure correction codes and seeking better repair topology,but none of them can provide a solution without shortcoming.Firstly,many studies were carried out to reduce either the data traffic or repair time,the fact is that optimizing the repair traffic only can lead to an increase in repairing time,while the optimization with the repair time as the goal without considering the data traffic may cause network congestion.Secondly,most of the research focus on mainly the limitation of bandwidth heterogeneity on repair time,while ignore the node processing power heterogeneous influence on overall repair time.However,with the encoding rule becoming more complicated and the increasing scale of repair topology,the effect of heterogeneity of node processing and coding complexity on the repair time cannot be ignored anymore.Then,there are few studies on parallel repairing of multi-node,since the encoding mechanism is relatively complex.As the consequences,there are not much research carried on extending single-node repair mode to multi-node repair mode.Lastly,existing repair schemes generally need to construct the optimal repair tree in order to complete the node repair operation,however,the multi-node repair method generally assumes that there are no intersections between the repair tree edges.Due to the sequential construction of multiple repair trees,the links with high useable bandwidth in the topology will be occupied by the first constructed repair trees,resulting in the increase of repair time and the increase of the number of repair trees.Therefore,there are many shortcoming with respect to this method.With respect to the above-mentioned problems and challenges in the storage system,the node repair process using erasure code in the distributed storage system is studied in this paper.The main works and innovations are summarized as following:(1)A new single-node failure repair topology for the single-node failure repair scenario with MSR(Minimal Storage Regenerating code)is proposed,it considers comprehensively the heterogeneity between the available bandwidth in the cluster and the processing capacity of the node,and optimizes the time delay of node recovery with consideration of both repair time and data traffic.A constrained Steiner-tree-model has been established with this new topology,and a hybrid genetic algorithm has been designed to obtain the global approximate optimal solution with the tradeoff between two optimization objectives.A series of experiments have been conducted,the results show that the repair time of repair topology constructed by this method is 70%～90% and 35%～45%shorter than the traditional tree repair topology and star repair topology,respectively.Although the repair traffic is increased by 10%～20% compared with the traditional star topology,it takes only 45% ～ 60% repair time of the traditional tree topology.(2)The single-node failure repair topology with MSR in(1)has been extended for the multi-node failure repair scenario.A new multi-node failure repair topology has been constructed,it allows the reuse of high-availability bandwidth links in the cluster.The multi-node repair problem is abstracted into a constrained optimization problem,with the repair time and the repair traffic as the objective functions.A hybrid genetic algorithm has been designed to obtain the global optimal approximate solution.The results of experiments show that the repair time of multi-node repair scheme based on this method decreased by 60% ～ 80% than traditional star repair topology with regenerating code.Compared with the edge disjoint design tree repair scheme,not only the repair time is greatly reduced,but also repair traffic is reduced up to 30% ～ 40%.

Keywords/Search Tags:

Distributed storage system, Single-node repair, Multi-node repair, MSR code, Genetic algorithm

Related items

1	Research On Node Repair Technology Of Distribute Storage System Under The Background Of Multiple Data Centers
2	The Research On Multi-node Repair Problem Of Distributed Storage System
3	Research On Repairing Algorithm Of Node In The Distributed Storage System
4	Research On Storage Node Selection And Node Repair Technology In Distributed Storage System
5	Research On The Technology Of Node Repair And Data Update In Distributed Storage System
6	The Research On Node Repair Problem Of Distributed Storage System
7	Research On Multi Stripe Repair Of Erasure Code In Distributed Storage System
8	Research On The Repair Pipelining Technology Of Erasure Codes In Distributed Storage
9	Design Of Piggybacking Framework In Distributed Storage System
10	Research On Multi-Strip Repair Of Load Balanced Erasure Code In Heterogeneous Distributed Storage System