Font Size: a A A

Research On Classification Technologies In Mining Unsteady Data Streams

Posted on:2010-03-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z Z OuFull Text:PDF
GTID:1118360305473645Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid growth in information technology, more and more real applications, such as wireless sensor networks, network traffic monitoring, credit card and fraud detection, generate flowing at high-speed, massive, and continuously arriving data known as data streams. Study on data stream processing technology is of great significance. Nowadays mining data streams has become one of the the forefront of research into data mining. In recent years a great deal of research work were done to solve the problems, known as concept drift, noise and imbalance, complicates the task of learning a model from data and requires special approaches.This dissertation focuses on the filed of classification techniques in mining unsteady data streams, mainly on issues of incremental learning method, ensemble classifier approach to concept drift and ensemble classifier approach to handling noise and imbalance data. The main work of this dissertation includes:1. This dissertation makes a survey of the research on the state-of-the-art in the growing vital yield of classify concept-drifting data streams, giving a summary of the approaches to detecting concept drift in streaming data, analyzing in detail the related processing system and algorithm, giving a critical review of the existent problems and development trend of the current classification techniques of data streams mining.2. This dissertation makes a study of the application of the traditional incremental learning classification techniques in data streams mining with concept drifting. CVFDT is one of the most successful methods of handling concept drift. On this base, a single-classifier SL_CVFDT has been proposed. Combing features of fast insertion and fast search of the skip list, SL_CVFDT satisfies both the rapidity of the example insertion, example search & example deletion in handling concept drift and the efficiency in selecting the best cut point. Experiments show that the single-classifier boasts good scalabilities and stabilities in handling concept drift.3. To handle concept drift and noise in realistic data streams, based on the averaging probability AP ensemble classifier under learnable assumptions, two ensemble classifiers named WEAP-I and WEAP-II have been proposed. WEAP-I, which integrates weighted ensemble classifier with AP ensemble classifier, solves the noise problems by buffering some historical data. Experiments have shown that WEAP-I has a very good anti-noise performance. WEAP-II, based on AP ensemble classifier and weighted ensemble classifier under splitting chunk technique, effectively solves the classification problem of gradual drifts inside data chunk and abrupt drifts between data chunks in noisy data streams mining. Our theoretical and experimental studies show that WEAP-II, compared with AP ensemble classifier, can better adapt to data streams mining in condition of co-existence of concept drift and noise, and has better classification performance, better anti-noise performance and similar or even lower time complexity.4. To solve the problem of imbalanced data streams mining with concept drifting under stationary assumption, based on the accuracy-weighted ensemble (AWE) classifier, a novel ensemble classifier framework named IMDWE has been proposed through undersampling and oversampling techniques. IMDWE takes different strategies to determine the weight value according to different targets of classification in the integrating learning process. Our theoretical and experimental studies show that IMDWE has lower time complexity compared with AWE. An average reduction ratio of execution time of IMDWE is 37.3% in the experiment, which adapts to imbalanced data streams mining with concept drift better. Furthermore, compared with AWE, IMDWE has better overall classification performances, whose average improvement ratio of G-mean metric is 7.22% in the experiment. It significantly improves the classification accuracy over the minority class. Compared with AWE, IMDWE's average promotion ratio of the recall metric is 15.63% in the experiment.5. For mining noisy data streams with imbalanced distribution, through undersampling and oversampling techniques, a novel ensemble classifier framework named IMDAP of averaging probability ensemble (AP) framework has been proposed under learnable assumption. Experimental results show that IMDAP has effectively solved the problem of imbalanced data streams mining with co-existence of concept drift and noise. Compared with AP ensemble classifier, IMDAP has similar time complexity, better anti-noise performances, and better overall classification performances (G-mean metric) with an average improvement ratio of 2.3%. In addition, IMDAP significantly increases the classification accuracy by 7.1% over the minority class averagely.
Keywords/Search Tags:Unsteady, Data Streams, Classification, Concept drift, Noise, Single-classifier, Skip List, Ensemble Classifier, Imbalance Data Streams, Sampling
PDF Full Text Request
Related items