| Due to high performance computing systems satisfy the requirement of computing power increasing and high-performance storage growing for the development of big data and cloud computing,high-performance computing becomes an efficient technical support for big data and cloud computing technology.Following the widely use of big data and cloud computing techniques,the issues about data security and system reliability become much more important.When high-performance computing systems are widely used,the size of the systems becomes bigger and system complexity becomes much higher,which resulting that the probability of failure occurance grows exponentially.Thus,how to establish an effective automatic fault detection mechanism becomes a hot issue in high-performance computing systems research area.This paper focuses on how to establish an effective and intelligent online fault detection system for high performance computing cluster systems.In order to solve this problem,we mainly do the following research:Comparing with normal information,anomaly data is very rare in high performance computing system.Thus,fault detection can be seen as a binary classification problem in pattern recognition.In addition,without using the historial data information,unsupervised learning methods only base on the data information during the system running.Therefore the unsupervised methods are also suitable while high-performance computing systems expanding.Based on the above two points,we propose unsupervised pattern recognition methods to solve fault detection problem in high-performance computing system,which expanding the application area of pattern recognition methods.And fault detection mechanism based on unsupervised learning method is scalable,because it can also be used when the the size of high performance computing system increasing.We focus on the research of all the data collecting from Operating system level.We use Linux system commands to get all the research data,uch as:memory,CPU,I/O,network.What’s more,Linux commands will still be effective to collect all kinds of data while the high performance computing systems growing up.Then,we propose,an automatic fault detection mechanism which combining PCA-based feature extraction and distance-based outlier detection method to determine abnormal data in system.Also we verify the validity of the proposed fault detection mechanisms by experiments.We consider two cases:a single error and a variety of errors.Finally,results shows that PCA-based fault detection algorithm is efficient for single error,i.e.,the accuracy of fault detection rate is high,while the false alarming rate is low.However;the result is not so good for many errors coexisting in system,this method.In order to solve the problem that the non-Gaussian data can not be separated with PCA algorithm,we choose ICA algorithm instead,which is an efficient method to make data becomes unrelated theoretically.By experiments,we verify that ICA-based method not only has high detection accuracy,also has low false detection rate.In addition,we found that ICA is better than PCA for fault detection in high performance system.What’s more,when existing a variety of errors in system,the accuracy of PCA algorithm is very low,but false alarm rate and missing alarm rate is very highFinally,we propose the algorithm PCA-ICA,which is the combination of PCA and ICA methods.We firstly choose PCA to achieve linear disparity,and then use ICA to make optimal seperating.Experiment results show that PCA-ICA method has higher accuracy than the ICA.In the end,we summarize the content of this thesis and points out what we need to do more research in the further. |