Font Size: a A A

Research On Classification Techniques In Data Streams Mining

Posted on:2008-04-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:T WangFull Text:PDF
GTID:1118360242499236Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,especially the emerging of the network technology,our abilities to collect,store and transfer data have been improved dramatically.Comparing to the explosive growth of data,the needs for decision-relevant knowledge are not satisfied yet.Knowledge discovery and data mining technology is an important approach to address this problem.Nowadays data streams processing is becoming the new hot field of database research.Due to the volume and rapidness of data in stream,conventional techniques of data mining won't be suitable any more,and it needs new techniques.Classification is a very important technique in the field of data mining,and it's one of the most successful business applications.The character of data streams model make classification in data streams more difficult.This paper focuses on the field of classification techniques in data streams,and proposes a series of algorithms to increase classification speed,to improve classification accuracy and to handle concept drift.We revisit the problem of continuous attributes handling in data streams and propose a decision tree classifier system named VFDTb that bases on binary search trees.The proposed system is on top of the most successful VFDT,and it achieves excellent performance.The most relevant property of the system is an average large reduction in processing time,while keeps the same tree size and accuracy.On top of VFDT and VFDTc,we propose a system VFDTt.It makes the following three contributions:1) we present a threaded binary search trees(TBST) approach for efficiently handling continuous attributes.It builds a threaded binary search tree,and its processing time for values inserting is O(nlogn),while VFDT's processing time is O(n2).When a new example arrives,VFDTc need update O(logn) attribute tree nodes, but VFDTt just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute.Comparing to the method used in VFDTc,it improves from O(nlogn) to O(n) in processing time.3) Comparing to VFDTc, VFDTt's candidate split-test number decrease from O(n) to O(logn).Overall,the techniques introduced here significantly improve the efficiency of decision tree classification on data streams.Decision trees are one of the most popular methods for learning and reasoning from feature-based examples.They have undergone a number of alternations to deal with language and measurement uncertainties.Fuzzy decision trees(FDT) is one of such extensions,it aims at combining symbolic decision trees with approximate reasoning offered by fuzzy representation.The intent is to exploit complementary advantages of both:popularity in applications to learning from examples and high knowledge comprehensibility of decision trees,ability to deal with inexact and uncertain information of fuzzy representation.On top of VFDT and VFDTc,we propose a system IFVFDT.It is combined with VFDTt and improves the soft discretizaiton method,so it has high classification accuracy and speed.Overall,comparing to VFDT,IFVFDT's average reduction of processing time is 16.66%,and comparing to VFDTt,fVFDT's average reduction is 8.87%.IF both algorithms have 10%noise data,VFDT's error rate trends to 12.5%,while the IFVFDT's error rate trends to 8%.Experiment results show that IFVFDT get better accuracy by using soft discretization,and it handles the problem of noise efficiently.Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution.Unfortunately,most of the large databases available for mining violate this assumption.They were gathered over months or years, and the underlying processes generating them changed during this time,sometimes radically.Hulten's CVFDT is one of the most successful one for handling concept drift. It stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable,and replacing the old with the new when the new becomes more accurate.On top of CVFDT,we propose a system HashCVFDT based on extended hash table.Combing hash table and list,it can sort attribute values efficiently,while still inserts and searchs value quickly.It's example insertion,example deletion and best cut point selection are efficient.
Keywords/Search Tags:Data Streams, Classification, Continuous Atrribute, Concept Drift, Binary Search Tree, Threaded Binary Search Tree, Soft Discretization, Extended Hash Table
PDF Full Text Request
Related items