Font Size: a A A

Research On Disk Failure Prediction Based On Cost-sensitive Learning

Posted on:2021-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:K ShanFull Text:PDF
GTID:2518306104487904Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of big data and cloud computing technology,more and more data and services are deployed in the data center,which makes the capacity of the data center storage system larger and more heterogeneous,leading to the probability of hardware failures in the storage system becomes higher,which in turn leads to a serious reduction the reliability and availability of the storage system,increasing operating and maintenance costs.Hard disk failure prediction as an active fault tolerance technology,with its failure to proactively improve the reliability and availability of storage systems,has attracted more and more attention from industry and academia.However,disk failure prediction still faces many problems,such as the imbalance of the dataset,how to balance the failure detection rate and the false alarm rate,and the high dimension of the failure data.These problems limit the prediction effect and we cannot further narrow the gap between the actual prediction results and the predicted theoretical value.Aiming at the problems of large capacity,heterogeneity,imbalance and high dimension of the S.M.A.R.T.dataset for disk failure prediction,this paper designs a cost-sensitive learning hard disk failure prediction method CSLM(Cost-Sensitive Learning Method).This method designs a feature selection algorithm that combines statistical indicators of effect quantities and genetic algorithm for high-dimensional problems.After filtering out the irrelevant features,the model's accuracy can be significantly improved.For the problem of positive and negative sample imbalance,the data distribution is balanced by cost-sensitive learning based on sample weighting,and a trade-off is made between the failure detection rate and the false alarm rate to reduce the cost of misclassification while ensuring the detection rate.To establish a cost-sensitive model for heterogeneous data sources,you can get a lower misclassification cost than a single data source;for sample-weighted costsensitive algorithms to find better classifiers,the paper compares some commonly used machine learning algorithms.This paper find that the integrated algorithm based on decision tree has the best effect in the hard disk failure prediction.The paper uses open source datasets to test the proposed method.The test results show that the feature selection algorithm has a 2% to 42% increase in AUC(area under the ROC curve)compared to the commonly used efficient rank sum test algorithm;cost-sensitive learning methods based on sample weighting can obtain a lower misclassification cost compared with Rank Model,the misclassification cost is reduced by 52% to 96%;the misclassification cost of modeling with heterogeneous data is 16% to 70% lower than that of modeling alone,and the false alarm rate is low 16% ? 70%,the failure detection rate of single data modeling is 3% ? 29% higher than using heterogeneous data modeling.
Keywords/Search Tags:Disk failure prediction, Machine learning, Cost-sensitive learning, Feature Selection
PDF Full Text Request
Related items