Font Size: a A A

Research On Semi-supervised Classification Algorithm For Data Stream With Concept Drift

Posted on:2020-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:K K QinFull Text:PDF
GTID:2428330599459739Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At present,in many application scenarios,data often appears in the form of data streams,which brings a new research direction—data stream machine learning.Compared with traditional machine learning,data stream machine learning brings new challenges to traditional machine learning tasks,especially classification tasks,due to the characteristics of the data stream itself(real-time,fast,large,variable,etc.).Currently,data stream related research mainly focuses on data stream classification in supervised environment and data stream clustering in unsupervised environment.In comparison,data stream classification research in semi-supervised environment is very rare,and there is no review article yet.However,in practice,the acquisition of data labels is time consuming and laborious,and the large amount of data in data stream and the real-time and fast features make it almost impossible to correctly label the data completely and timely.For example,in a credit card online fraud detection scenario,when a new transaction occurs,the current classifier model is used to predict whether the type of transaction is normal or fraudulent.When the customer receives the bank statement,he(she)will identify whether the forecast for the transaction type is accurate or not and provide feedback to the bank,so that the bank can obtain the true type of the transaction.However,not all users provide feedback,and this process has a time delay,so the classification model is usually updated in semi-supervised environment.Therefore,the study of concept drift data stream classification in a semi-supervised environment is more realistic and more meaningful.The study of concept drift data stream classification in semi-supervised environment mainly faces two challenges:1)How to construct a classification model with good generalization ability and continuously update the classification model in semi-supervised environment(only a small number of randomly selected samples are labeled);2)How to effectively detect concept drift in semi-supervised environment and how to effectively adjust the classification model after detecting the concept drift.The main research contents of this paper are as follows:First,briefly summarized data stream classification,comprehensively and thoroughly combed currently research on the concept drift data stream classification under the current semi-supervised environment.Second,for the problem of SPASC algorithm in the pool update process-after the classifier pool is full,the original update strategy will lead to poor adaptability to the concept drift,proposed an algorithm SSCLCR by improving the update process of the pool through the"Local component replacement"strategy.A series of experimental results show that this method can better update the classifier pool,which improves the classification accuracy."Local component replacement"is also a viable solution to the problem of different concept classes overlaped caused by concept drift.Third,at present,the cluster-based concept drift data stream classification algorithm often specifies the number of clusters in advance when constructing the classifier and the number of clusters remains unchanged during the processing of data stream,which is obviously unreasonable in the data stream environment.In addition,the number of clusters has a great influence on the accuracy of the algorithm,while,there is no uniform standard for its setting.In view of the above problems,this paper proposes the algorithm S~2CD-TL,the main work includes:i.Based on the map,proposed a CUMSUM type algorithm to estimate the number of clusters,and a cluster-based classifie trained based on it;ii.A classifier culling strategy based on maximum diversity is proposed to update the classifier pool;iii.An ensemble learning weighting strategy based on transfer perspective is proposed for classifying data.Compared with baseline techniques,the proposed algorithm has higher classification accuracy and slightly higher time omplexity.Fourth,considering that the existing chunk-based processing algorithm is more suitable for periodic concept drift,while does not perform well in more complex concept drift scenarios.Inspired by the human memory storage model,we propose an algorithm OLFLSSL based on online and offline storage model combined with streaming KNN for classification.The algorithm learns the data stream in an online manner through a hierarchical index structure.The concept drift processing mechanism is triggered at intervals to extract knowledge from the leaf nodes and clear the samples of the node.The offline module is then updated based on concept drift detection and extracted knowledge.Experimental results show that the proposed algorithm has higher or at least considerable accuracy compared with the baseline model,and has better adaptability to complex concept drift scenarios.The innovations of this paper are as follows:1)For the shortcomings of SPASC algorithm,the proposed algorithm SSCLCR utilizes the characteristics of cluster classifier,and proposes the concept of classifier local component replacement to update the classifier;2)Algorithm S~2CD-TL utilized a CUMSUM type unsupervised method to estimate the number of clusters and proposed a classifier culling strategy and weighted integration classification method based on clustering classifier,used for pool update and classification separately.3)Algorithm OLFLSSL designs a storage model combining online and offline to learn from data stream in real time.
Keywords/Search Tags:Concept drift data stream, semi-supervised classification, ensemble learning, cluster-based model
PDF Full Text Request
Related items