Font Size: a A A

Research On Hard Disk Failure Prediction Method Based On Improved Random Forest Algorithm

Posted on:2020-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:T L ZhangFull Text:PDF
GTID:2428330575951694Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,emerging technologies such as the Internet of Things,cloud computing,cloud storage and big data have developed rapidly.The global data volume has grown exponentially.Nearly 90% of the global data is stored in hard disks in data centers.Due to the structure and the storage mechanism of the hard disk,once the hard disk fails,the data stored in the hard disk may be permanently lost,which will causing serious losses to enterprises and individuals.Although the redundancy backup mechanism of data can prevent data loss when the hard disk fails,it increases the cost of data storage.Predicting the failure of hard disk has become the main trend.The development of S.M.A.R.T.(Self-Monitoring,Analysis and Reporting Technology)technology and operation and maintenance technology laid the foundation for hard disk fault prediction.At present,almost all hard disks support S.M.A.R.T.technology,which provides a data foundation for fault prediction of hard disks,and the operation and maintenance mode of the data center changes from automated operation and maintenance to artificial intelligent operation and maintenance based on machine learning methods.So using machine learning method for hard disk fault prediction can improve the accuracy of hard disk fault prediction and ensure the safety and reliability of data storage.In this paper,we analyze the characteristics of hard disk S.M.A.R.T.data in real data center,and choose to use improved random forest algorithm to establish fault prediction model,and predict hard disk failure.The main research work of this paper includes:(1)According to the characteristics of the hard disk S.M.A.R.T.data in the real data center scenario,The S.M.A.R.T.data is a high-dimensional data,so a data dimensionality reduction method based on correlation coefficient is proposed.By calculating the correlation coefficient between different S.M.A.R.T.attributes,oneattribute is selected instead of other highly relevant attributes.(2)For the problem of unbalanced S.M.A.R.T.data of hard disk in real data center scenario and the disadvantage of random forest algorithm in dealing with unbalanced data,an improved SMOTE algorithm is proposed to balance data before data modeling.(3)Optimizing the shortcomings in the traditional random forest model,including increasing the pruning operation of the decision tree,selecting the decision tree and assigning the weight of the decision tree.(4)In the real data center scenario,the hard disk S.M.A.R.T.data is generated over time,so an incremental learning strategy is proposed.Through the incremental learning strategy,we can use new data update model to ensure that hard disk failure prediction model has lasting learning capabilities.
Keywords/Search Tags:Hard disk failure prediction, machine learning, S.M.A.R.T.technology, random forest, incremental learning
PDF Full Text Request
Related items