Font Size: a A A

Study Of Classifying Uncertain Data Streams

Posted on:2015-04-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Q LiangFull Text:PDF
GTID:1228330434470185Subject:Agricultural Electrification and Automation
Abstract/Summary:PDF Full Text Request
In many real-life applications, such as wireless sensor network, credit card frauddetection, and network monitoring, huge amount of streaming data are continuously generatedat a high speed, along with uncertainty due to imprecise measurement, repeated sampling,missing value, privacy protection etc. Uncertain data streams are becoming increasingprevalent and important; managing and mining on uncertain data streams have become hotresearch topics. Classifying data stream is a very important branch of data stream mining;many real-life applications can be modeled as data stream classification problems. However,most of existing data stream classification algorithms need to take as input a sequence ofprecise and definite data item, which makes them unsuitable to real-life applications.This paper focused on data stream classification problems, mainly including how to learnuncertain very fast decision trees from uncertain data streams, how to summarize thedistribution of an uncertain data stream, how to detect and cope with evolving concept hiddenbehind uncertain data streams, and how to classify uncertain data streams with only positiveand unlabeled samples available. The contributions of this paper include:(1) To classify uncertain data streams, on the basis of Very Fast Decision Tree (VFDT)algorithm, this paper proposed uVFDTc, an uncertain VFDT tree with uncertain naive Bayesclassifiers at tree leaves. Methods were presented to track the distribution of streaminguncertain numerical values. Techniques were developed to collect sufficient statistics fromuncertain data streams. The naive Bayes model was extended to handle uncertainty online. Inlearning phase, the proposed uVFDTc algorithm applied Hoeffding bound theory on thecollected sufficient statistics and yielded fast and reasonable decision trees. In classificationphase, at tree leaves it used uncertain naive Bayes (UNB) classifiers to improve classificationperformance. Experimental results demonstrated the strong ability of uVFDTc to classifyuncertain data streams. The use of UNB at tree leaves could improve the performance ofuVFDTc, especially the any-time property, the benefit of exploiting uncertain information,and the robustness against uncertainty.(2) This paper studied the problem of computing approximate quantiles on uncertain datastreams. To capture the distribution of an uncertain data stream more precisely, based on GKalgorithm, this paper proposed uGK algorithm for summarizing approximate quantiles on streaming data, and applied it in uVFDTc algorithm for classification analysis. The proposeduGK algorithm used the same summarizing data structure and had the same space complexityas those of GK algorithm. In summarizing, only a single pass and very little space wereneeded, but the distribution captured by summarized data could approximate the real observeddistribution at the specified precise level. Experimental results showed that uGK usedsubstantially less space than that indicated by analysis. More importantly, the required spacedid not rise as the uncertain level increased, but the summarized data could always supportquantile query operation at the specified precise level. Applied in classification algorithm, theproposed uGK allowed uVFDTc to have a better classification accuracy.(3) Based on uVFDTc and CVFDT, this paper proposed uCVFDTc algorithm to learnvery fast decision tree with ability to cope with concept drift. In learning phase, uCVFDTcused the same techniques as those of uVFDTc to handle uncertain samples and collectsufficient statistics. It applied Hoeffding bound theory on collected statistics to yield a fastand reasonable decision tree. Besides, by using the techniques of sliding window and alternatetree, uCVFDTc had the ability to cope with concept drift. In classification stage, at tree leavesit also used UNB classifiers to improve classification performance. Experimental resultsshowed that uCVFDTc had strong ability to learn from uncertain data stream, detect and copewith concept drift; the use of UNB at tree leaves could improve the performance of uCVFDTc,especially the ability to handle concept drift and the robustness against uncertainty.(4) This paper studied the problem of positive and unlabeled learning on uncertain datastreams. To classify uncertain data stream with only positive and unlabeled samples, thispaper proposed puuCVFDT algorithm to learn very fast decision tree from uncertain datastream, using only positive and unlabeled samples. The technique for collecting sufficientstatistics was given. The uncertain information gain for measuring the purity of positive andunlabeled samples (puuIG), and the method for calculating puuIG on collected sufficientstatistics were proposed. Based on puuIG and Hoeffding bound theory, a series decision treeswere built. The method for selecting the best tree as the output of the proposed puuCVFDT,and the method for classifying unknown samples were given. Experimental resultsdemonstrated that the proposed puuCVFDT algorithm had strong capabilities to learn fromuncertain data streams with positive and unlabeled samples and tackle concept drift. Evenwhen uncertainty level extended to30%, and only10%of the samples in data streams werepositive, the classification performance of puuCVFDT was still very close to that of CVFDT,which was trained on fully labeled data without uncertainty.
Keywords/Search Tags:uncertain data, data stream classification, very fast decision tree, conceptdrift, positive and unlabeled learning
PDF Full Text Request
Related items