Mining Concept-Drifting Data Streams With Semi-Random Multi-Decision Trees

Posted on:2009-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:P P Li

Full Text:PDF

GTID:2178360245471559

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development and wide applications of information technologies, streaming data have become ubiquitous, such as supermarket transactions, Internet search requests, telephone call records etc. Because streaming data are continuous, high-volume and open-ended, traditional mining algorithms cannot mine databases from these data environments in real-time, which leads to the loss of useful information. Classification is one of the most important branches of data mining and the research on the classification in data streams is very significant and valuable in real-word applications, such as the credit-card-fraud prediction and the network intrusion detection. However, most of traditional classification algorithms need scanning the databases multiple times and storing the entire data, which are not suitable for the streaming environments. Therefore, efficient mining of data streams has become a popular research topic in data mining.The main contributions of this thesis are in classification on data streams as follows.(1) Existing efforts on data stream research and concept drifting in data streams are reviewed, which include the definitions, applications, theoretical foundations and measurement standards. Related work on the algorithms and models for data streams and concept drifting in data streams is surveyed and analyzed.(2) When performing classification on data streams, traditional classification algorithms based on decision trees have a relatively poor efficiency in both time and space due to the characteristics of streaming data. Thus, an incremental algorithm for mining data streams, SRMTDS based on random decision trees is proposed in this thesis, which uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naive Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.(3) An incremental algorithm SRMTCD for mining data streams with concept drifting based on semi-random decision trees is presented for the characteristics of hidden concept changes in the context of data streams, which takes two sliding windows, uses the inequality of Hoeffding Bounds to determine the thresholds for distinguishing the true drift from noise, and chooses the classification function to estimate the error rate for periodic detection. Our extensive empirical study shows that SRMTCD has an improved performance in time, accuracy and anti-noise capabilities in comparison with CVFDT, a state-of the-art decision-tree algorithm for classifying concept-drifting data streams.

Keywords/Search Tags:

Data Streams, Concept drift, Semi-Random DecisionTrees, Naive Bayes

PDF Full Text Request

Related items

1	Classification Algorithm For Data Streams With Concept Drift And Its Applications
2	Research On Concept Drift Detection Algorithms For Data Streams
3	Window-based Classification Algorithms For Concept Drifting Data Streams
4	Based Spam Filtering System Designed To Adapt To Concept Drift
5	Study On Data Streams Online Classification Algorithm Of Adapting To The Concept-Drift
6	Concept Drifting Detection And Classification On Data Streams
7	Research On Classification For Data Streams With Concept Drift
8	Research On Classification Algorithms Of Concept Drift And Imbalanced Data Streams
9	Study Of Mixture Ensemble Classifications For Mining Data Streams With Concept Drift
10	Learning On Evolving Data Streams