Font Size: a A A

Failure Tolerance And Prediction For Storage Systems In Datacenters

Posted on:2020-10-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y W XieFull Text:PDF
GTID:1368330629982968Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasing volumes of the data,more and more storage devices are deployed in datecenters,and failure rates of storage devices are increasing.To improve the reliability of storage systems in datacenters,replication scheme and erasure coding scheme are widely deployed to tolerate failures,and storage device failure prediction is deployed to enable before-failure handling.First,the two redundancy schemes are often deployed together according to the popularity of data,but the transition between them easily causes so much cross-rack traffic and risky co-located blocks because of the diversity of the data layouts.It influences the reliability of the storage systems and the performance of online data services.Second,there are many different disk series of similarities and differences in datacenters,resulting in the challenge of building high-quality failure predictive models for all disk series.Third,complex models are employed in disk failure prediction to improve the accuracy because of their strong learning capacities,but they are of low explainability,so bias and over fitting are easily hidden in the model,resulting in the much worse performance in application than that in test.For distributed storage systems in datacenters,a new encoding scheme,a new modeling method for disk failure prediction,and a new explanation method are proposed to support the high reliability of storage systems both theoretically and technically.To improve the transition between replication scheme and erasure coding scheme for distributed storage systems,a new encoding scheme,named NSSE(Non-Sequential Striping Encoder),is proposed to keep the reliability of the data and to improve the encoding performance.When encoding replicated data,NSSE constructs non-sequential stripes according to the data layouts,encodes the stripes according to local computing,and then guarantees no co-located blocks after encoding.Moreover,NSSE matches the replica numbers and the access popularity of the data to amortize the encoding overheads and improve the storage efficiency with the promise of keeping the load balance.In the evaluation,compared to current encoding schemes,NSSE can reduce the cross-rack traffic by above 50%,reduce the encoding time by above 30%,and reduce the time overhead of concurrent I/O-intensive applications added by the encoder by about 60%.To address the challenge of building failure predictive models for many disk series in datacenters,a new modeling method,named OME(Optimized Modeling Engine),is proposed to improve the models for all disk series.OME builds one-for-one models,transferred models and one-for-all models according to the scale of failed disks and then OME tunes and compares these models by validation to select the best model for every disk series.Moreover,OME calculates the similarities between disk series to select the best source for transfer learning,and employs a simple instance-based transfer learning method.In the evaluation with a public dataset collected from real-work datacenters,OME outperforms a current modeling method for heterogeneous disks by overall improving F1-score by 18.5%,reaching 0.7715.In detail,OME improves precision by 22.3% and recall by 14.5%,reaches97.18% in accuracy and reduces false alarm rate by 18.5%.To improve the explainability of complex models for disk failure prediction,a new explanation method,named DFPE(Disk Failure Prediction Explainer),is proposed to explain the models and the failure predictions made by the models.Current explanation methods only explain models and outputs by providing feature importances,which are not enough for disk failure prediction.By contrast,to explain the models,DFPE analyzes the previous failure cases,infers the prediction rules and calculates their detection rates and false alarm rates,which help to detect and handle the bias and overfitting hidden in the models so that it can improve the generalization of the models.To explain the failure predictions,DFPE finds out the working prediction rules and shows their detection rates and false alarm rates so that one can determine whether to believe it or not and infer the failure behaviors,which helps to improve the believability of the models and enable the intelligent failure handling.In the evaluation,a case on a public dataset from a real-world datacenter is presented to show that the visible explanations made by DFPE are more detailed and accurate,which helps to target and handle the hidden bias and improve the believability of the failure predictions.
Keywords/Search Tags:Storage Reliability, Data Redundancy, Failure Tolerance, Disk Failure Prediction, Explainability
PDF Full Text Request
Related items