Font Size: a A A

Research On Fast Recovery In Large-scale Storage System

Posted on:2020-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z F WangFull Text:PDF
GTID:2428330626964593Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,the scale of distributed storage system grows rapidly.If every device in storage system have a constant possibility to fail,due to the growing number of devices in system,the data durability and availability in large-scale storage system become lower.A fast data recovery rate can enhance system durability and availability in such systems.But as the system is providing service to its customers,if accelerating the data recovery blindly,it will introduce interference to the foreground traffic,which degrades the performance of both sides and wastes precious bandwidth resources.Therefore,a fast and low-interference data recovery approach should be proposed.To this end,this thesis explores the method accelerating data recovery in a large-scale storage system with minimal interference to foreground traffic.Based on the observation from production system,this thesis finds why existing approaches fail to produce good recovery plan,and designs a timeslot-based centralized scheduling framework.To achieve high performance of such a centralized scheduler and enhance the scheduling quality,this thesis proposes a series of key techniques to realize high scheduling quality and speed based on observation.With these designs,the protocol proposed by this theses succeeds to bring a fast recovery speed as well as low interference to the foreground.The main contribution of this thesis includes:(1)By investigating I/O and failure traces from a real-world large-scale storage system,this thesis finds that because of the scale of the system and the imbalanced and dynamic foreground traffic,on the one hand,no existing recovery protocols can generate a high-quality recovery strategy in a short time.On the other hand,when node fails there are massive chunks to be recovered and large number of candidates as helper,sophisticated scheduling algorithms fail to produce result in a short time.(2)Based on our observation,this thesis proposes Dayu,a timeslot-based recovery protocol,which only schedules a sub-set of tasks which are expected to finish in one timeslot: this approach reduces the computation overhead and can naturally cope with the dynamic foreground traffic.In each timeslot,Dayu incorporates four key algorithms,realizing fast and high quality scheduling.(3)Dayu is implemented based on Pangu and tested both on real-world cluster and in simulation environment.The evaluations in a 1,000-node real cluster confirm that Dayu can outperform existing recovery protocols,achieving high speed and low interference.The evaluations on 25000-node simulation confirm Dayu has good scalability.
Keywords/Search Tags:large-scale storage system, data recovery, scheduling, fast and low-interference
PDF Full Text Request
Related items