Font Size: a A A

Research On Semi-supervised Classification Of Data Stream Based On Adaptive Density Clustering

Posted on:2022-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:C J LiuFull Text:PDF
GTID:2518306554971139Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In today's era of big data,massive data is continuously generated through various hardware and software,and with the features of high-speed and infinite,these data are generated and arrive in the form of streams.At the same time,the distribution of these data will change with the change of itself or the external environment,so that the data stream in the real scene has the characteristic of concept drift.The application scenarios of traditional data mining require the data to be processed in batch and meet the conditions of independent identical distribution.Therefore,the traditional static data mining is not suitable for the data stream with high speed,infinite and concept drift.So,dynamic data stream mining is introduced to solve this problem.Due to the large number of unlabeled instances appear in data streams in real scenarios,data stream mining is faced with a new challenge,that is,how to better classify data streams with only a few labeled instances.It is impractical,timeconsuming and laborious to label all the instances in the process of high-speed generation of data stream,and a large number of unlabeled instances also contain a lot of useful information.Blindly discarding unlabeled instances will lead to insufficient generalization ability of the trained model,which leads to the introduction of semi-supervised learning.The classification of concept drift data streams in semi-supervised scenarios faces the following two main challenges:(1)how to train a classification model with good generalization ability and online learning using a small number of labeled instances and a large number of unlabeled instances;(2)how to conduct accurate concept drift detection in a semi-supervised environment,so as to timely reflect the changes in data distribution,so as to adjust the classification model in time to adapt to the current data distribution,and obtain better classification accuracy through model update.To sum up,considering the research value and new challenges brought by the classification of concept drift data streams in a semi-supervised environment,the research contents of this paper are summarized as follows:First,the online learning framework based on very fast decision tree can update the model by pruning to adapt to concept drift,while taking into account the dynamic characteristics of data streams.SSCADP(Semi-supervised classification of data streams based on adaptive density peak clustering)is proposed by applying adaptive locating cluster center algorithm to very fast decision tree clustering algorithm.In the classification stage,the samples are given prediction labels through the decision tree.In the learning stage,the samples after classifying are processed in turn,and the samples fall to the corresponding leaf nodes.When a detection period is reached,the adaptive locating cluster center algorithm is invoked to form cluster clustering and clustering for the samples in the leaf nodes,and the majority voting method is adopted to label the unlabeled samples in each cluster.If the concept drift is detected,the model will be pruned.If the sample in the leaf node reaches the specified threshold,the optimal partitioning attribute will be selected based on Hoeffding inequality to continue splitting the leaf nodes.For concept drift,the change of high-density samples is more likely to reflect the change of data distribution is considered.An improvement was made on the concept drift detection method for judging the variation amplitude of cluster distance in SUN(Learning from concept data streams with unlabeled data)algorithm.A large number of experimental results verify the advantages of SSCADP algorithm.Second,concept drift detection in traditional supervision environment mostly takes accuracy rate as the measurement index,and the appearance of a large number of unlabeled samples in semi-supervision environment brings great uncertainty to explicit concept drift detection with accuracy rate as the indicator.Therefore,this paper considers the use of implicit adaptive concept drift strategy.S2CDTL(Semi-supervised classification of data stream with concept drift via transfer learning perspective)algorithm was proposed.The algorithm dynamically maintains a classification ensemble pool and uses a cluster classifier as the base classifier.The cluster classifier is trained by the adaptive cluster center localization algorithm proposed in the first work.At the initial time,the classifier trained on the first data block is directly added to the integration pool for initialization.When the data block to be classified comes,the model transferred classifiers combine with the classifier which is trained on the last data block will give the prediction label by majority vote.The specific strategy of model transfer is to use the samples in the current data block to update the integrated model at the previous moment to adapt the concept drift implicitly.When the number of classifiers in the pool reaches the specified threshold,the integration model is updated based on the maximizing diversity policy.A large number of comparative experiments show that the algorithm performs well on most data sets.
Keywords/Search Tags:Clustering, Decision tree, Ensemble learning, Semi-supervised classification, Concept drift
PDF Full Text Request
Related items