Font Size: a A A

Concept Drifting Detection And Classification On Data Streams

Posted on:2013-03-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:P P LiFull Text:PDF
GTID:1268330398975894Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and broad applications of information technologies, streaming data have become ubiquitous, such as supermarket transactions, Internet search requests and telephone call records. These streaming data are full of valuable knowledge, and bring new characteristics as being continuous, high-volume, open-ended, and concept drifting with missing labels, which are challenging for traditional mining algorithms. Thus, mining concept drifting data streams with unlabeled data is a significant problem in real-world applications.In this thesis, we focus on the classification of data streams with concept drifts and unlabeled data, from three perspectives. First, we explore new efficient and effective classification models for noisy data streams. Second, with the proposed classification models, we detect various concepts drifts in noisy data streams. Third, we propose classification methods for concept drifting data streams with unlabeled data. The main contributions are as follows.1) Contrary to the traditional data sources for data mining, streaming data in the real world are continuous, high-volume, open-ended, and concept drifting with missing labels. It is a significant challenge for traditional mining models (such as decisions trees, neural networks and SVMs) in terms of prediction accuracy and time/space overheads. Motivated by this, we develop new classification models based on random decision trees to tackle data streams. For variants of random decision trees, we propose a series of data stream classification methods called ERDT (Ensembling Random Decision Trees). Extensive experiments show that our methods are superior to state-of-the-art streaming classification methods on classification accuracy and the overheads of time and space.2) In order to detect various concept drifts and reduce the impact from noise in real-world applications of data streams, we present a series of classification methods based on Ensembling Random Decision Trees for Concept drifting data streams (called ERDTC). Meanwhile, we develop a new light-weighted classification algorithm for Concept Drifting detection in virtue of an ensembling model of complete Random Decision Trees (named CDRDT). Extensive studies on synthetic streaming data demonstrate that our proposed methods can effectively and efficiently detect concept drifts from noisy streaming data compared to several well-known classification methods.3) It is also a challenge to learn from concept drifting data streams with unlabeled data in the real world. With this motivation, we propose a Semi-supervised classification algorithm for data Streams with concept drifts and UNlabeled data (called SUN) in this thesis. In SUN, we develop a clustering algorithm from k-Modes to produce concept clusters at leaves in an incremental decision tree. By comparing deviations between history concept clusters and new ones, we distinguish potential concept drifts from noisy data streams. Experiments demonstrate that SUN can especially adapt to abrupt concept drifts and sampling changes in data streams. Meanwhile, it is comparable to several state-of-the-art on-line supervised and semi-supervised algorithms.4) To track recurring concept drifts in a data stream environment with unlabeled data, we propose a Semi-supervised classification algorithm for data streams with REcurring concept Drifts and Limited LAbeled data, called REDLLA, in which, a decision tree is adopted as the classification model. When growing a tree, a clustering algorithm based on k-Means is installed to produce concept clusters and unlabeled data are labeled by majority-class at leaves. In view of deviations between history and new concept clusters, REDLLA can distinguish potential concept drifts and maintain recurring concepts. Extensive experiments confirm that our REDLLA algorithm outperforms several state-of-the-art online classification algorithms and online semi-supervised algorithms on classification accuracy and the time overhead.5) Lastly, we apply our classification methods on data streams with concept drifts and unlabeled data in the classification of two sets of real-world streaming data, including the Yahoo shopping data and the electric market data. Extensive studies reveal the advantages of our methods over several state-of-the-art online classification algorithms and well-known online semi-supervised algorithms. Meanwhile, experiments show that our methods enable adapting to concept drifting data streams and performing better on classification accuracy and the overheads of time and space.
Keywords/Search Tags:Data Streams, Classification, Recurring Concept drift, Random DecisionTree, Unlabeled data
PDF Full Text Request
Related items