Font Size: a A A

Online Concept Drift Detection Based On Data-windows

Posted on:2015-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:M LiuFull Text:PDF
GTID:2298330434956275Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
As the information age development, some fields like network security, stockanalysis, meteorology monitoring, credit-card fraud etc. would yield a large quantityof data which could not be stored in a long term. These dynamic data, growingendless as time goes by, is called data stream. Analysis and processing data stream isrestricted by storage capacity and calculating speed; meanwhile, abundant andvaluable knowledge hides inside the data stream; the hidden knowledge may changedynamically as the environment changes and time passes, which is called conceptdrift.Data in the data stream has the characterisitics of speediness, real-time, infinity,wide area as well as dynamic change; some concept drifts may present periodicity,which makes the models established on initial dataset be hard to adjust to the variabledistribution characteristics, and bring great challenges to analyze stream data. Amongthem, how to detect the concept drift occurred in stream data accurately and promptly,and how to adapt the concept drift have been the hotspots and hard problems in thefields of machine learning and data mining. This thesis focuses on detection problemsof the concept drift in dynamic stream data, and the main works are as follow:(1)Research results about detecting the concept drift in stream data in recentyears are concluded, and the merits and defects in the existed drift detection algorithmare analyzed.(2)In order to discover the concept drift of various types and the exact positionthat different concept drifts take place accurately and quickly, a detection method foronline concept drift based on overlapped data windows is proposed. This methodcalculates the heterogeneous Euclidean distance of neighboring overlapped datawindows and adopt KNN(K-nearest neighbor)principles to distinguish theinconsistency extent of the samples in data windows, so that the evaluation ofdistributional diversity and the detection of concept drift could be achieved. To assessthe validity of this method, tests have been done on the public dataset with diverseseverity of concept drift and drift speed; it turns out that our overlapped data windowscan detect the occurrence of concept drift more prompt and accurate thannon-overlapped data windows do.(3)A detection method based on the Canonical Correlation Analysis with self-adapted data windows is studied. This method regards the coming data streamsequence as a series of rectangular windows, then through Singular ValueDecomposition and Canonical Correlation Analysis, the evaluation of distributionaldiversity could be realized; in addition, it could adjust data windows to suit thedetection of concept drift of various types. Experiments conducted on artificial driftdataset which included diverse severity of drift and drift speed show that thisalgorithm can discover the occurrence of gradual drift and mutational shiftexperiments on semi-artificial shift dataset have validated that the adjust datawindows perform better than the fixed data windows; Finally, the algorithm wasapplied to a real dataset—drift detection, i.e., the Power supply dataset from ItalianPower Company.(4)Apart from drift detection, in the process of treating concept drift datastream, another basic problem to be addressed is the model adaptation. This thesisadopt the diversity strategies of component classifier for online ensemble learningalgorithms to implement adaptive model learning and evaluate the feasibility byexperiment.
Keywords/Search Tags:data stream, concept drift, heterogeneous Euclidean distance, KNNnearest algorithm, Canonical Correlation
PDF Full Text Request
Related items