Font Size: a A A

Adaptive classification of scarcely labeled and evolving data streams

Posted on:2010-10-27Degree:Ph.DType:Dissertation
University:The University of Texas at DallasCandidate:Masud, Mohammad MehedyFull Text:PDF
GTID:1448390002473247Subject:Computer Science
Abstract/Summary:
In this dissertation we propose solutions to four major problems encountered by data stream classification, namely, infinite length, concept-drift, concept-evolution and limited labeled data. Traditional data stream classification techniques address only the infinite length and concept-drift problems. Data streams are continuous flows of data, such as network traffic, sensor data and call center records. The goal of data stream classification is to build a model using past labeled data and use the model to predict the class labels of future instances. Data streams are inherently infinite in length. Concept-drift occurs in data streams when the underlying concept of the data changes over time, and concept-evolution occurs when new classes evolve. Data streams that flow at high speed also suffer from scarcity of labeled data since it is impossible to manually label all the data points in the stream.;We propose three different techniques to address these problems. First, we propose efficient solutions to the infinite length and concept-drift problems using an ensemble classification approach. It solves the infinite length problem by dividing the stream into equal sized chunks such that each chunk can be stored and processed in main memory. It builds v classification models from r consecutive chunks using v-fold cross-validation type partitioning. An ensemble of such models is used to classify unlabeled data. Concept-drift is addressed by periodically updating the ensemble with newer models. Second, we provide a novel class detection technique for data streams, that addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. To the best of our knowledge, this is the first work that addresses the concept-evolution problem in a data stream classification framework. Our proposed technique automatically detects the presence of a novel class in data streams by analyzing and quantifying the cohesion among the unlabeled test instances, and separation of the test instances from the training data. Finally, the limited labeled data problem is addressed by building a stream classification model with scarcely labeled training data using semi-supervised clustering and ensemble classification approach. Our techniques outperform state-of-the-art data stream classification techniques on a number of benchmark stream datasets.
Keywords/Search Tags:Data stream, Classification, Labeled, Infinite, Addresses the concept-evolution problem
Related items