Font Size: a A A

Mining massive data streams

Posted on:2006-11-21Degree:Ph.DType:Thesis
University:University of WashingtonCandidate:Hulten, GeoffreyFull Text:PDF
GTID:2458390005992251Subject:Computer Science
Abstract/Summary:
Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. In this thesis we develop a method that can semi-automatically enhance a wide class of existing learning algorithms so that they can learn from such high-speed data streams in real time. In particular, our method can be applied to essentially any induction algorithm based on discrete search. After applying our method the algorithm: learns from data-streams in an incremental, any-time fashion; runs in time independent of the amount of data seen, while making decisions that are essentially identical to those that would be made from infinite data; uses a constant amount of RAM no matter how much data it sees; and adjusts its learned models in a very fine-grained manner as the data generating process changes over time. We evaluate our method by using it to produce a series of learning algorithms---for decision trees, Bayesian network structure, and clustering---which are all capable of learning from high-speed data streams. We evaluate these learners with extensive studies on synthetic data sets, and by applying them to a collection of massive real-world mining tasks.
Keywords/Search Tags:Data streams, Mining
Related items