Font Size: a A A

Research On Semi-supervised Classification Of Data Stream Based On Clustering

Posted on:2021-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:S LiuFull Text:PDF
GTID:2518306554966129Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Most traditional machine learning algorithms run in a static closed environment,and it is often assumed that the distribution of the data remains the same while the algorithm executes.However,in many practical application environments,a large amount of data is generated in the form of high-speed infinite stream,and the distribution of these data is constantly changed due to factors such as equipment loss and environmental changes,resulting in concept drift.This way of generating and changing data presents great challenge to traditional static data mining.Data stream mining came into being in this context,and data stream classification plays a vital role in large-scale real-time data processing.Affected by factors such as high labeling cost,large amount of data,and fast data generation,labeling all data is expensive and impractical.Data stream classification in a semi-supervised environment study how to use a part of labeled samples and a large number of unlabeled samples to detect changes in data distribution,train and update models? Therefore,the semi-supervised classification of data streams is more in line with the actual application scenario and has many practical application values.At the same time,the semi-supervised environment also brings new challenges to data stream classification:1)the generalization ability of the model trained on a small number of labeled samples is relatively poor.How to use the internal structure and distribution of a large number of unlabeled samples to assist model training and updating;2)Accuracy-based concept drift detection methods do not adapt well to the semi-supervised environment.How to use both labeled and unlabeled samples to detect changes in data distribution and adapt the dynamic data stream environment through model updates.Considering the practical application value of data stream semi-supervised classification research and the new challenges brought by the small number of samples,this article conducts research from two aspects:(1)Existing research on data stream classification mainly focuses on supervised learning,while semi-supervised classification of data streams has not yet attracted attention enough.Therefore,based on the comprehensive collection of work of semi-supervised classification of data streams,this paper sorts the existing semi-supervised data stream classification algorithms into several types from several aspect,describes and summarizes more than 40 existing algorithms based on the types of classifier used in the algorithms and the concept drift detection methods utilized in them;on some widely used real and synthetic datasets,several semi-supervised classification algorithms for data streams are chosen to be compared and analyzed in many aspects;At the end,this paper proposes some issues that is worthy to be further discussed in future for semi-supervised classification of data streams.(2)Considering that clustering algorithms can capture the inherent structure and distribution of data;many researches have applied clustering to the field of semi-supervised classification of data streams.However,the existing algorithms do not take into account the local structure information of the samples in the concept drift detection,and cannot accurately detect new concepts and reoccurring concepts;cluster-based classifiers cannot be incrementally updated by data batches under the same concept to improve their generalization capabilities.Therefore,this paper proposes a semi-supervised classification algorithm based on BIRCH ensemble and local structure mapping.Specifically,the semi-supervised Bayesian method and the local structure mapping strategy in transfer learning are combined to calculate the local similarity between each sample and each classifier to achieve concept drift detection;when the algorithm detects reoccurring concept,the corresponding BIRCH ensemble classifier is incrementally updated to improve the generalization ability of the model.A large number of comparative experiments verify the advantages of the SCBELS algorithm from many aspects.
Keywords/Search Tags:concept drift, data stream, clustering, ensemble learning, local structure mapping
PDF Full Text Request
Related items