Font Size: a A A

Mining Concept-Drifting Data Streams With Semi-Random Multi-Decision Trees

Posted on:2009-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:P P LiFull Text:PDF
GTID:2178360245471559Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and wide applications of information technologies, streaming data have become ubiquitous, such as supermarket transactions, Internet search requests, telephone call records etc. Because streaming data are continuous, high-volume and open-ended, traditional mining algorithms cannot mine databases from these data environments in real-time, which leads to the loss of useful information. Classification is one of the most important branches of data mining and the research on the classification in data streams is very significant and valuable in real-word applications, such as the credit-card-fraud prediction and the network intrusion detection. However, most of traditional classification algorithms need scanning the databases multiple times and storing the entire data, which are not suitable for the streaming environments. Therefore, efficient mining of data streams has become a popular research topic in data mining.The main contributions of this thesis are in classification on data streams as follows.(1) Existing efforts on data stream research and concept drifting in data streams are reviewed, which include the definitions, applications, theoretical foundations and measurement standards. Related work on the algorithms and models for data streams and concept drifting in data streams is surveyed and analyzed.(2) When performing classification on data streams, traditional classification algorithms based on decision trees have a relatively poor efficiency in both time and space due to the characteristics of streaming data. Thus, an incremental algorithm for mining data streams, SRMTDS based on random decision trees is proposed in this thesis, which uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naive Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.(3) An incremental algorithm SRMTCD for mining data streams with concept drifting based on semi-random decision trees is presented for the characteristics of hidden concept changes in the context of data streams, which takes two sliding windows, uses the inequality of Hoeffding Bounds to determine the thresholds for distinguishing the true drift from noise, and chooses the classification function to estimate the error rate for periodic detection. Our extensive empirical study shows that SRMTCD has an improved performance in time, accuracy and anti-noise capabilities in comparison with CVFDT, a state-of the-art decision-tree algorithm for classifying concept-drifting data streams.
Keywords/Search Tags:Data Streams, Concept drift, Semi-Random DecisionTrees, Naive Bayes
PDF Full Text Request
Related items