Font Size: a A A

Research On Classification Algorithms For Imbalanced Data Stream With Concept Drift

Posted on:2022-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:B LiangFull Text:PDF
GTID:2518306527977919Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The explosive growth of information leads to the widely appearing of data streams in various application fields,such as wireless sensor data streams,stock trading data streams,e-commerce data streams,etc.How to efficiently obtain the valuable information in these data streams has become the main task of data stream mining research.Different from the general static data,the data stream usually has the characteristics of high speed,large scale,dynamic change.The change of the target concept or potential distribution in data streams results in the concept drift,which seriously affects the effect of data mining.In addition,there usually exists the problem of class imbalance in data streams,which makes it difficult to obtain the information contained in the minority class instances.Therefore,how to solve the problem of concept drift and class imbalance in data streams has become a popular research direction in the field of data stream mining.Aiming at the joint problem of concept drift and class imbalance leading to a significant degradation on the accuracy of data stream classification models,this paper proposes three data stream classification algorithms,and demonstrates the performance of the proposed algorithms through simulation experiments.The main tasks of the paper are highlighted as follows:(1)Aiming at the two main problems in the current data stream classification algorithms that deal with concept drift: one is the high drift detection delay and false alarm rate,and it is difficult to deal with different types of drift at the same time.Another is the lack of the ability to identify recurring concepts.To this end,this paper proposes a data stream classification algorithm,which is based on active detection mechanism and is feasible to multiple types of concept drift.The algorithm uses a double-layer window to save the newest classification results,assigns weights to the data in the window according to the membership function and calculates the weighted error rate,and then applies the Mc Diarmid bound to detect concept drift by evaluating the significance of error rate change in the current window.After the drift is detected,the semi-parametric log-likelihood algorithm is used to check whether the current concept is a recurrence of the past concept,and then decides whether to reuse the old classifier.Experimental results show that the proposed algorithm outperforms the similar existing algorithms in term of detection delay,false positive rate,classification accuracy and running time.(2)To solve the joint problem of concept drift and class imbalance in binary classification data streams,this paper proposes a G-mean weighted online data stream classification algorithm that introduces the online update mechanism of component classifiers and their weights to modify blockbased ensemble algorithms,combining resampling and adaptive sliding window method.The algorithm is based on the ensemble learning framework that once a new instance reaches,each component classifier and its weight are updated online,and the minority class instance is randomly oversampled.Each component classifier determines the weight based on the G-mean performance on several recently incoming instances,where G-mean is calculated based on the time decay factor incrementally.At the same time,the algorithm periodically constructs a balanced dataset based on the current data and trains a new candidate classifier,then selectively adds it to the ensemble.The results on real and synthetic datasets show that the comprehensive performance of the proposed algorithm outperforms other baseline algorithms.(3)To settle the problem that most classification algorithms for imbalanced data streams with concept drift only consider the two-class situation,this paper,for multi-class situations,proposes a dynamic weighted data stream classification algorithm using a hybrid sampling mechanism.The algorithm is based on the ensemble learning framework that incrementally calculates the size of each class.When a new instance reaches,each base classifier and its weight are updated online.The weight of each base classifier is determined according to their MGmean performance on the newest instance,while the learning frequency of each instance is determined by the ratio of the maximum number of all classes in the current data stream to the number of the class which the instance belongs to.In addition,the algorithm periodically uses hybrid sampling to construct multiple different datasets,and on this basis,trains multiple candidate classifiers with differences to improve the generalization ability of the algorithm.Experimental results show that the comprehensive performance of the proposed algorithm is better than other baseline algorithms.
Keywords/Search Tags:Data stream mining, Concept drift, Ensemble learning, Class imbalance
PDF Full Text Request
Related items