Adaptive classification of scarcely labeled and evolving data streams

Posted on:2010-10-27

Degree:Ph.D

Type:Dissertation

University:The University of Texas at Dallas

Candidate:Masud, Mohammad Mehedy

Full Text:PDF

GTID:1448390002473247

Subject:Computer Science

Abstract/Summary:

In this dissertation we propose solutions to four major problems encountered by data stream classification, namely, infinite length, concept-drift, concept-evolution and limited labeled data. Traditional data stream classification techniques address only the infinite length and concept-drift problems. Data streams are continuous flows of data, such as network traffic, sensor data and call center records. The goal of data stream classification is to build a model using past labeled data and use the model to predict the class labels of future instances. Data streams are inherently infinite in length. Concept-drift occurs in data streams when the underlying concept of the data changes over time, and concept-evolution occurs when new classes evolve. Data streams that flow at high speed also suffer from scarcity of labeled data since it is impossible to manually label all the data points in the stream.;We propose three different techniques to address these problems. First, we propose efficient solutions to the infinite length and concept-drift problems using an ensemble classification approach. It solves the infinite length problem by dividing the stream into equal sized chunks such that each chunk can be stored and processed in main memory. It builds v classification models from r consecutive chunks using v-fold cross-validation type partitioning. An ensemble of such models is used to classify unlabeled data. Concept-drift is addressed by periodically updating the ensemble with newer models. Second, we provide a novel class detection technique for data streams, that addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. To the best of our knowledge, this is the first work that addresses the concept-evolution problem in a data stream classification framework. Our proposed technique automatically detects the presence of a novel class in data streams by analyzing and quantifying the cohesion among the unlabeled test instances, and separation of the test instances from the training data. Finally, the limited labeled data problem is addressed by building a stream classification model with scarcely labeled training data using semi-supervised clustering and ensemble classification approach. Our techniques outperform state-of-the-art data stream classification techniques on a number of benchmark stream datasets.

Keywords/Search Tags:

Data stream, Classification, Labeled, Infinite, Addresses the concept-evolution problem

Related items

1	Research On Method Of Novel Class Detection And Classification For Concept-Drifting Data Stream Mining
2	Research On Semi-supervised Data Stream Classification Method Based On Ensemble Model
3	Research On The Classification Of Data Stream With Concept Drift Based On Cosine Similarity
4	Research On Data Stream Classification Algorithm With Limited Amount Of Labeled Data
5	Research On Dynamic Data Stream Classification Algorithm
6	Research Of Evolving Data Stream Clustering
7	Research On Classification Algorithm For Conceptual Drift Data Flow
8	Research On Classification And Regression Algorithms On Concept Drifting Data Streams And Its Application
9	Research On Ensemble Classification Algorithms Of Data Stream Based On Concept Drift
10	Research On Concept Drift Detection In Data Stream And Classification Algorithms For Imbalanced Data Stream