Font Size: a A A

Research On Method And Application For Failure Prediction Of Heterogeneous Disks In Large Data Center

Posted on:2021-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1488306107457424Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The disk is one of the most commonly used devices for data storage.Disk failure prediction is an important means to ensure data reliability.The method of disk failure prediction can fall into two categories: device level(i.e.,a complete disk failure)and sector level(i.e.,a partial disk failure)prediction.At present,researches mainly use some traditional machine learning methods such as support vector machines,logistic regression,decision trees,and random forests to predict disk failures,and have achieved good prediction results.However,these methods still have the following three disadvantages:(1)training a predictive model using a small number of disks(minority disks)from one model in the data center will make the predictive model easy to overfit,and result in poor predictive results;(2)the currently proposed methods are not a generic prediction model which is limited by the size of disk dataset,positive and negative sample ratios,the applicability and adaptability of the predictive model,resulting in poor prediction results;(3)the currently predictive model of latent sector errors only focuses on the study of the binary classification model,but it results in additional unnecessary disk scrubbing costs when using the binary classification results to optimize the disk scrubbing strategy which is not practical.In order to solve these disadvantages above and the actual needs of the large data centers,study heterogeneous disk failure prediction methods and applications to solve the above problems.For the problem of poor prediction results of minority disk,we proposed transfer learning based failure prediction for minority disks of heterogeneous disk systems in large data centers(TLDFP).We call those relatively small amounts of disks minority disks.Due to the insufficient training data of minority disks,traditional machine learning(ML)algorithms using those training data would dramatically increase the risk of overfitting or poor generalization which will lead to the poor performance of predictive models and seriously affect the reliability of the storage system.TLDFP applies the KLD(Kullback-Leibler Divergence)value to measure the distribution difference between two datasets and select the majority disk model dataset with the smallest KLD value.Then use the transfer learning algorithm Tr Ada Boost to build a predictive model by appropriately adjusting the weights of the majority disk model samples during the training process which reduces the distribution difference between the majority sample disk model data set and the minority disk model data set to achieve the purpose of the failure prediction for minority disk.Besides,when applied to three real-world datasets,TLDFP achieves on average 96% FDR(failure detection rate)with a 0.5% FAR(false alarm rate)and first confirmed the predictive performance of TLDFP in three different type disks(HDD,SSD)when making cross-disk models failure prediction in addressing realistic system challenges.For the problem of the current prediction model is not generic in heterogeneous disk systems,we proposed a method that enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers(HDDse).In addition to the minority disk failure prediction problem in large data centers,the predictive methods proposed by some studies are not a generic prediction model for heterogeneous disks.Specifically,there is no one generic model that can simultaneously solve all the disadvantages of the existing methods.HDDse is a method that combines the advantages of distance-based anomaly detection method and deep neural network prediction method,and innovatively proposes an LSTM(Long Short-Term Memory)based siamese network method.The structure of LSTM is used to learn the dynamically changed long-term behavior of disk healthy statues,and the siamese network structure can map low-dimensional disk information to highdimensional for feature learning and generate a unified and efficient high-dimensional disk state embedding for the failure prediction of heterogeneous disks.HDDse has good adaptability to predict disks which have not appeared in the training datasets and deliver good performance for the imbalance or minority disk datasets.We evaluate HDDse using two real-world datasets to confirm that it is effective and outperforms several state-of-the-art approaches thus improving storage system reliability.For the current problems of the sector-level failure prediction,we proposed an adaptive and tiered disk scrubbing scheme with improved MTTD and reduced cost(Tier-Scrubbing,TS).Device-level disk failure prediction results often fail to meet the actual needs of the current data center.There are two reasons.First,some sector-level disk failures,such as latent sector errors(LSE)will not result in device-level disk failures but these sector errors will cause I/O read and write errors which may affect data reliability.Second,the current device-level disk failure prediction model has a 1% FAR which results in the large additional cost of replacing disks in large-scale data centers.Therefore,some researchers have begun to focus on using artificial intelligence(AI)technology to predict the LSE and use the predictive results to optimize the disk scrubbing strategy.However,these studies have several limitations.First,these methods take a single snapshot of training data for prediction,without considering the sequential dependency between different statuses of a hard disk over time.Second,these models only apply binary classification to the status of a sector,and only accelerate the scrubbing at a fixed rate,which can result in unnecessary scrubbing cost.Third,the existing approaches simply uniformly increase the scrubbing rate for the full disk without giving special consideration to high-risk areas.We propose a novel scrubbing scheme called Tier-Scrubbing(TS).It contains an adaptive and effective scrubbing scheme.It proposes an adaptive scrubbing rate controller based on the LSTM model,which can not only predict the LSE disk but also predict the risk degree of the LSE disk.Using this prediction result can accelerate the disk scrubbing at an adaptive rate.At the same time,we designed a module that focuses on the locality of sector errors to locate high-risk areas in the disk to further improve scrubbing efficiency,and also proposes a piggyback scrubbing strategy that can take advantage of I/O operation characteristics to improve the reliability of the storage system.The experiments have shown that this method decreases about 80% MTTD while decreasing 20% scrubbing cost compare to the state-of-theart method.
Keywords/Search Tags:Data storage, Disk Failure, Sector Error, Disk scrubbing, Artificial Intelligence, Machine Learning
PDF Full Text Request
Related items