Font Size: a A A

Research On Dynamic Data Stream Classification Algorithm

Posted on:2014-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:F ChenFull Text:PDF
GTID:2248330398950259Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the development of information technology, the traditional data mining is facing unprecedented challenges:Mining target are changing, from static data in the database, to the real-time dynamic data stream mining. The data stream have the characteristics of massive (that means data can’t be all saved), real-time process and instability (there are concept drifts in data stream). Now, the data stream mining research hot spots including credit card fraud detection, network security monitoring, sensor data monitoring and power grid.In the environment of dynamic data stream, the traditional classification method is difficult to adapt to high-speed, high-performance requirements. In the same time, due to concept drift, knowledge implied in the data may change over time. This requires the classification model dynamic updates with data changes. In the facing of concept drift, traditional classification methods often fail and not suitable for dynamic data stream classification. Therefore we need to propose a new classification method.For concept drift, inspired by the KL divergence method of concept drift method. This paper discuss a method use the KL divergence for conceptual similarity, with KDQ tree divided the data set and Bootstrap determine similarity threshold.For dynamics of data stream, based on the method of concept similarity, this paper proposed a new data stream semi-supervised classification model. In this model, by dividing the data stream into sub dataset, when new data coming, based concept similarity method to select the appropriate classifier for classification. Artificial datasets and real datasets are used to evaluate the performance of the model. The experiments show that the proposed model can deal with both the dramatic concept drift and slow drift, and has a good ability of self-adapting.For massive of data streams, this paper proposed a high parallelism algorithm based on the MapReduce framework for dynamic data stream classification, the proposed algorithm based on the extreme support vector machine incremental learning method, tracking real-time data stream concept drift, by construct a weight matrix to fix the model residuals, by using forgetting factor to enhance the role of the new sample. Experiments show that the method has a good parallel performance while handling of dynamic data stream concept drift efficiently.
Keywords/Search Tags:Data stream classification, Concept drift, Concept similarity, Timeforgetting robust extreme support vector machine
PDF Full Text Request
Related items