With the explosive growth of global data and the increasing scale of distributed storage systems,hard disk failures have become the norm and system data reliability and service availability guarantees are greatly threatened.Compared with a single passive fault-tolerant technology,a distributed storage system combined with active fault-tolerant technology can respond more comfortably and effectively to the problems caused by hard disk failures.The active fault tolerance technology of storage system mainly includes two aspects of hard disk failure prediction and data recovery in advance,and a lot of research has been conducted on it at home and abroad,but most of the hard disk failure prediction models are built for a single model of mechanical hard disk and cannot meet the hard disk heterogeneity problem in distributed storage system,and the research gap in data recovery in advance for distributed storage scenarios requiring low latency and high reliability needs to be filled.Therefore,the current storage system active fault tolerance technology cannot meet the needs of distributed storage system scenarios.In this thesis,we investigate the active fault tolerance scheme for distributed storage systems with the goal of improving system data reliability and ensuring system service availability.First,for hard disk failure prediction,a Multi-type Disk Failure Prediction(MTDFP)method for distributed storage systems is proposed,which can build a corresponding hard disk failure prediction model with better prediction performance for each type of hard disk series in distributed storage systems.The MTDFP is validated in two enterprise real public datasets,and the experimental results show that the method can achieve an average of 78% FDR(Failure Detection Rate),which provides a better basis and guidance for the subsequent data advance recovery strategy in the distributed storage system active fault tolerance scheme.Secondly,a Data Scheduling Optimizer(DSO)for distributed storage system based on spare storage resource pool and early warning priority is proposed for early data recovery,which can migrate the dangerous data on multiple pre-failed hard disks in advance in the order of early warning priority to Each spare drive.The DSO has been applied and experimented on Ceph storage systems,and the experimental results show that the strategy not only greatly reduces the additional data migration and data recovery time of the cluster,but also significantly improves the performance of cluster read and write operations.Finally,based on MTDFP and DSO,a whole set of active fault tolerance scheme for Ceph storage system from acquisition to prediction to scheduling is formed.The reliability quantification results show that the scheme can improve data reliability by 1-3dimensions in Ceph clusters deployed with different policies. |