Font Size: a A A

An Adaptive Classification Method For Data Stream Based On Active Learning

Posted on:2021-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhangFull Text:PDF
GTID:2428330614458397Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data stream contains rich information.In recent years,there are more and more studies on data stream.Compared to static data,data stream has the characteristics of high speed,infinity,concept drift,concept evolution and scarcity of labels.Concept drift and concept evolution will lead to the decline of classification ability of the data stream classification model over time,which requires the model with the ability of self-adaptation.However,most of the existing data stream classification model with self-adaptation ability usually assume that the coming data can get the real label after acquire the predicted label.But,this assumption is unreasonable in some cases,as labeling data tends to be costly and time-consuming.Therefore,this thesis considers the problem in the scenario with scarcity of labels,concept drift and concept evolution.First,to solve the problems of concept drift and scarcity of labels in data stream,an active learning method for the concept drift adapting process is proposed.Thus method considers the problem of sampling bias which may occur in the process of active learning by combining the uncertainty active learning method with the BOD(Boundary and Outlier Detection)method.This method utilizes the advantages of the uncertainty samples can reflect the decision boundary and the boundary points and outliers can reflect the feature space.This combination aims at that the small number of samples active chose to acquire true labels can reflect the true distribution of all the data as much as possible.Then,considering the problems of concept drift and scarcity of labels in data stream,the paper adds the research on concept evolution problem,and makes the following improvements to the existing classification model EMC(Evolving Micro-Clusters).Firstly,a distance weighted strategy is proposed to improve the classification strategy of EMC.Secondly,an active learning method based on uncertainty and random selection is proposed to make the model adapt to the environment of scarcity of data labels.Thirdly,a new class detection method NDLRD(Novelty Detection based on Local Relative Density)is proposed.NDLRD considers the aggregation characteristics of the new class samples and the fact that the new class samples are close to the existing class samples.The local relative density is used to measure the degree of samples belong to the new class.For the first research,compared to the 100% label algorithm HAT(Hoeffding Adaptive Tree)and OBA(Oza Bag Adwin),the proposed algorithm can make the classification model maintain the same accuracy learning with an average about 20% labels under concept drift.This proves the good active learning ability of the proposed method.For the second research,the experiments are carried out on 9 real and synthetic data sets.The experiments show that the improved model has better results on most of data sets,which indicates the effectiveness of the improved model.
Keywords/Search Tags:data stream, concept drift, novel class detection, active learning, adaptive classification
PDF Full Text Request
Related items