Font Size: a A A

Research On Semi-supervised Data Stream Classification Method Based On Ensemble Model

Posted on:2022-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhengFull Text:PDF
GTID:2518306560455654Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and big data technology,more and more real-world applications in our daily lives,such as news retrieval,Taobao shopping and bank transactions,are generating massive amounts of streaming data.Contrary to the static data used in traditional data mining tasks,these data streams are possessed of many new characteristics such as high-volume,high-speed,multiple labels,hidden concept drift,and concept evolution.Meanwhile,they possibly present the multi-label characteristic and aggravate the issues of label imbalance and label noise,which makes the classification of data streams face unprecedented challenges.How to efficiently and accurately mine the potentially valuable information in the data stream has become an important task of data stream classification.This dissertation aims to take advantage of the semi-supervised classification model to carry out our classification method on a series of problems such as the lack of label information in data streams.The main contributions are as follows.(1)To handle with the problem of insufficient data label information and concept evolution in actual data streams,a semi-supervised classification algorithm for single-labeled data streams is proposed in this dissertation.This method uses a small amount of labeled data to construct a semi-supervised classification model.Meanwhile,in order to detect the occurrence of concept evolution,this method uses the properties of category clusters,clustering within clusters and sparseness between clusters to confirm whether an instance is a novel class instance.In addition,considering the hidden recurring concept drift,the method first uses detection mechanism to track the significant changes in the confidence score window,and then calculates the distance of the distribution before and after the drift to confirm the recurring concept drift.A large number of experiments show that: as compared with the classic data stream classification methods,the proposed method not only presents a higher classification accuracy,but also can effectively detect recurring concept drift and concept evolution hidden in single label data streams.(2)To deal with the issues of concept drift,class label imbalance and label noise aggravated in multi-label data streams,a semi-supervised classification algorithm is proposed for multi-label data stream.More specifically,this method uses a small amount of labeled data to construct a classification model.To adapt to multiple types of concept drifts(namely heterogeneous concept drifts)in multi-label data stream,this method adopts a self-adjustment sliding window mechanism to adapt to heterogeneous concept drift.To handle the label noise and class imbalance in multi-label data stream,this method adopts an error punishment mechanism to delete the data polluted by the label noise and the data that causes the class imbalance from the window as soon as possible.A large number of experiments show that as compared with the classic multi-label classification methods and the multi-label data stream classification methods,the proposed method can adapt well to the issues of heterogeneous concept drift,label noise and class imbalance,while it could maintain a better classification accuracy under various data conditions.
Keywords/Search Tags:Data stream classification, Concept drift, Concept evolution, Semi-supervised classification
PDF Full Text Request
Related items