| As the recent explosive development of Internet, a new kind of data has formed-data streams (i.e., streaming data), such as statistical web page clicking data, credit card operation data and network real-time monitoring data. Meanwhile, we could also find streaming data in medical and bioinformatics research area. There are two main challenges in mining streaming data:(1) Some users may only pay attention to one class of sample. For example, when users browse news online, some users may only focus on particular topic, such as sport. Therefore, they may only want to get documents about sport and may not pay attention to other topics. Similar scenario can be found in credit card fraud data and plant growth information monitoring. In these scenarios, users only pay attention to particular class of data over other classes.(2) As we all know, labeling samples is very time-consuming and cost-intensive. What is more, the amount of data in streams is always very huge. Therefore, it is unrealistic to label all samples in streams. As a special kind of semi-supervised learning, PU learning only requires some positive examples (i.e., the class of samples that we focus on-target class) as well as a huge amount of unlabeled examples, thereby saving huge amount of human power. And the cost may be the reduction of classification performance.This thesis mainly discussed how to modify decision tree used to classify streaming data and make it capable of learning samples in streams incrementally and handling positive and unlabeled data. To summarize, there are mainly two contributions:(1) Construction PUVFDT to tackle positive and unlabeled data streams without concept drift. PUVFDT combines classic data streams classification algorithm—VFDT and the way POSC4.5 computing information gain. We simulate the PosLevel representing the ratio of positive samples in original full labeled data set to build 9 decision trees. One strategy is used to choose the best one from 9 trees. Experiments on both synthetic data and real-life data show that the classification performance of PUVFDT is satisfactory. Even 80% of the samples in the stream are unlabeled, the classification performance of PUVFDT is still very close to that of VFDT which is trained on fully labeled data stream. This is a great achievement of saving human power with only 20% of the sample labeled, and makes PUVFDT more applicable to real-life applications. (2) Based on the analysis of stability of PUVFDT, we proposed an"over-sampling"like strategy to ensemble PUVFDT, thereby improving its classification performance. By computing the stability of PUVFDT, based on the currently stability criterion used, we can conclude that on our synthetic data sets and real world data set, our PUVFDT is stable. Then based on Oza et al.'s research, one strategy is proposed. Firstly, let all base classifiers learn the sample from data streams once. Then with help of Poisson distribution, we can get the times that this sample will be re-learned by each base classifier. Experiments on both synthetic and real world data sets show that our strategy is significant: the F1 and classification accuracy have been improved. Meanwhile, t-test has been used and the results of t-test prove that these improvements are statistically significant. |