Font Size: a A A

Research On Classification For Data Streams With Concept Drift

Posted on:2020-11-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y G SunFull Text:PDF
GTID:1368330578976882Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data stream classification is one of the important researches in data mining,which is characterized with the data sequences generated continuously in the form of streams.For example,data sequences arrive in a fast,real-time,continuous form from the applications of sensor network anomaly detection,credit card fraud behavior monitoring,weather forecasting and electricity price forecasting.The key issue existing in data stream classification is concept drift,i.e.,underlying distribution of the data may change dynamically over time.It indicates that the intrinsic nature of data stream is non-stationary,which greatly affects the performance and update cost of the classification model.In the data streams environment,the concept may reappear after a period of time.This phenomenon is so-called recurring concept issue that causes algorithms to continuously train new model on the same concept,wasting training time of the model,even reducing its overall performance.In addition,the issues of class imbalance and multi-label also affect the performance of the models.The former expects the model to predict the minority instance more accurately,while the latter expects the model to predict the labelsets of instances accurately.In order to meet the above challenges,the dissertation proposes four novel classification algorithms for data stream with concept drift on the basis of analysis of the characteristics of data streams and related theories.The main contributions can be summarized as follows.(1)An Adaptive Windowing Detection based Ensemble(AWDE)is proposed.Unlike the traditional ensemble algorithms,AWDE uses adaptive sliding window detection to construct the corresponding training data for each base classifier The process of AWDE is as follows:Firstly,the adaptive sliding window detection is used to capture the concept drift explicitly;secondly,AWDE uses adaptive sliding window detection method to construct the training data adaptively for each base classifier to solve the issue of dependence on block size.Based on data block integration,a base classifier is constructed by selecting part of data in sliding window.Finally,the ensemble based on both accuracy and diversity is used to enhance the generalization ability of the classifier.The theoretical analysis and experimental results show that AWDE can effectively deal with different types of concept drift and reduce the training time and memory consumption of the model on the premise of ensuring high accuracy.(2)A Recurrent Detection and Prediction(RDP)based on concept transfer graph model is introduced.Unlike traditional algorithms,each node in the graph model stores a basic classifier(historical concept),and the weights of the edges can effectively reflect the repetitiveness of the concepts.In the learning phase,a change detection method based on Jensen-Shannon divergence is used to detect the concept drift and recurring concept,and to guide the updating of the graph model In the prediction phase,according to the established graph model,single classifier or ensemble method is used to predict unknown instance.In addition,in order to accelerate the learning efficiency of the concept transfer graph model and compress the storage space of graph model,a feature selection method based on symmetric uncertainty is used to preprocess the data.Experiments on both synthetic and real-world datasets show that RDP performs significantly better than the state-of-the-art algorithms,especially when concepts reappear.(3)A novel Two-Stage Cost-Sensitive classification(TSCS)is proposed for tackling the class imbalance problem in data streams with concept drift.Different from the above methods,TSCS aims to solve class imbalance issue in data streams with concept drift.Its training process includes two steps:preprocessing step and learning step.For the preprocessing step,TSCS adopts cost-sensitive strategy to select the feature subset space that can effectively balance the class distribution.For prediction,TSCS uses weighted aggregation based on double cost-sensitive metric to predict the label of the instances.Compared with the existing algorithms,TSCS can achieve better classification performance on both synthetic and real-world data streams with class imbalance and concept drift.(4)An ensemble classifier based on random labelset(LPLDC)is proposed for multi-label data streams.The basic idea is that in the process of training,the labelset is divided into several smaller disjoint label subsets,and the probabilistic classifier chain method is used for each labelset.When the concept drift occurs,the weights are updated according to the performance of each base classifier on the latest data block,and the dynamic weighting strategy is used to prediction.In addition,the adaptive sliding window detection algorithm is equipped in the algorithm to deal with concept drift.The experimental results show that LPLDC can predict the labelset of the instance more effectively on most of the datasets,and is more suitable for concept drift environment.Aiming at the issues that need to be solved urgently in data streams classification,the dissertation proposes a series of effective solutions to construct a more effective concept drift data stream learning mode.The proposed schemes reduce the space-time overhead while maintaining the classification efficiency of the algorithm,and improves the conceptual drift adaptability of the algorithm,thus providing new research ideas and theoretical basis for the theoretical research and practical application of concept drift.
Keywords/Search Tags:Data streams, Classification, Ensemble Classification, Concept Drift, Concept Recurring, Class Imbalance, Multi-label
PDF Full Text Request
Related items