Font Size: a A A

Research On Imbalanced Data Processing Methods For Industrial Big Data

Posted on:2019-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:S L ChenFull Text:PDF
GTID:2428330623950965Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and intelligent computing technology,massive industrial data is collected,stored,analyzed and used for decision support.Intelligent data analysis based on industrial big data is increasingly attracting the attention of industry and academia.Machine learning based fault detection is a kind of important application of industrial big data.It can help to reduce the loss of faults and improve the quality of industrial products through timely detection of equipment faults.Fault detection of industrial equipment usually requires a very low error rate and a single error can have serious consequences.However,practice and research show that there is an inevitable imbalanced data challenge in industrial big data equipment fault detection,resulting in a low recall score of machine learning algorithms.In this paper,based on the characteristics of industrial big data,we study the machine learning algorithms of imbalanced data and the real-time processing technology of industrial big data,and have made the following research results:For the problems in the existing data sampling and integration learning imbalanced data learning methods,we propose an imbalanced data learning algorithm called Rotation SMOTE based on the fusion of SMOTE,Bagging and Boosting algorithms.In the training process of Boosting model,this method performs targeted data synthetic sampling on minority samples to improve the recall score based on the prediction result of the base classifier,and PCA is used to transform the original samples by some rotating actions to obtain the ensemble of multiple models,which increases the diversity of samples.Experiments show that compared with other imbalanced learning algorithms such as SMOTEBoost and EasyEnsemble,the Rotation SMOTE algorithm can significantly improve the recall score and have the best or second-best G-mean and F1 Score on most datasets.For the limitation of cost-sensitive Boosting imbalanced learning algorithm that the misclassification costs of all samples in one class are equal,we draw on the basic idea of Focal Loss in deep learning and propose a Boosting method which can distinguish the ease degree of classification,which is called FocalBoost.In the training process of Boosting,we use the predict probability of the weak model as the reference for the next sample weight updating,make each sample but not each class of samples receive different degrees of attention.Experiments show that compared with the original AdaBoost algorithm,FocalBoost algorithm can get better classification performance on imbalanced datasets.In order to serve the intelligent analysis of industrial big data better,we use open source distributed software such as Kafka,Spark,OpenTSDB to design and implement a real-time processing framework based on industrial big data,and optimize it from the aspects of optimizing configuration,reducing computation and network overhead,as well as load balancing.Experiments show that the system processing performance can reach more than one million data points per second.
Keywords/Search Tags:Industrial Big Data, Imbalanced Data, SMOTE, Boosting, Ensemble Learning, Focal Loss
PDF Full Text Request
Related items