| Supercomputers play a crucial role in modern scientific research and practice,and China’s supercomputers have been at the world’s leading level and have been successfully applied in many fields.With the increase of computing power,the size and complexity of supercomputers are also increasing,which brings great challenges to the reliability of the system.Accurate fault prediction can help avoid potential risks and improve the stability and maintainability of supercomputers.The fault log data generated during the operation of supercomputers contains valuable fault information and can provide a reliable data source for the establishment of fault prediction models.In this paper,we propose a fault prediction model using HDBSCAN clustering and CNN-Bi LSTM-Attention mechanism based on the fault log data of Tianhe II supercomputer at Luliang Supercomputing Center in Shanxi Province.The model first classifies the fault log features based on the HDBSCAN clustering algorithm,and then characterizes the fault logs to obtain the spatio-temporal distribution characteristics of the system,which provides the analysis basis for the subsequent fault prediction.Finally,the CNN-Bi LSTM-Attention model was used to extract features from the fault log data,combining the temporal and feature information,so as to achieve the prediction of fault time and fault nodes.According to the experimental results,it can be proved that the fault prediction method proposed in this study has the advantages of easy feature extraction,high sensitivity of time series features and adequate extraction of local features compared with the traditional machine learning methods.For fault time prediction,the method has a high prediction accuracy,and the prediction accuracy of fault node location is not less than 92.1%,which effectively improves the accuracy,reliability and stability of supercomputer fault prediction. |