Font Size: a A A

Research And Application Of Business Data Stream Classification Mining Based On Incremental Storage

Posted on:2012-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:X J YinFull Text:PDF
GTID:2178330332483103Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Dynamic data stream mining has become a hot topic in data mining, such as in the field of communications, data stream mining on phone records to find the high-quality customers, data stream mining on the Web user clicks and the network monitoring to find the underlying server attack, data stream mining on the retail business to achieve the recommended related services and so on. All above cases are the dynamic business data stream mining. Dynamic data streams must be adapted to data mining of massive, continuous, mutation, confidentiality, requirement of fast processing and updates, read only once and so on, which is different from the traditional static data mining. The mutation of business data leads to the concept of it changes over time, which inevitably leads to the update of the conceptual model, and then rises to the concept drift. Infinite data streams, concept drift and some other characteristics make the classification model on data streams is different from the traditional classification model. You need to quickly deal with the influx of data and timely adjust the model to reflect the new classification information.Based on the existing research, in this paper, I firstly research the storage of dynamic commercial data streams, and propose the dynamic incremental storage tree structure; secondly, I research the concept drift of data streaming and propose the integrated Bayesian classification technology and the strategy of the real-time updates of storage, which is named a power of two; finally, based on all the above research I propose a classification algorithm for the data streams with concept drift based on the incremental storage tree (CMCD-ST).This research and innovation include the following:First, I make study of the data mining, data mining business application background and its existing models and summarize the latest research in the field to find the advantages which may be applied to the research of commercial data mining.Second, for storage of data streams, based on the characteristics of Bayesian algorithms and data streams, this paper presents the dynamic incremental storage tree structure, which changes the storage in the units of records to properties trees. The tree's size is determined by the number of properties, property values and category numbers. As a result, data streams capacity is also determined not by the number of records but by the number of attributes, property values and category numbers, and thus we can solve the data storage problems, which is the biggest problem in dynamic data streams mining.Third, the research on the multiple linear correlations among the business data streams properties gives the self-sampling technique, which is used to optimize and cut the properties of the under classify data. This solves the problem of multiple linear correlations:Fourth, I make research on the concept drift in commercial data mining, and then build a number of dynamic incremental storage trees and design the strategy of the real-time updates of storage, which is named a power of 2. Combining the integrated Bayesian technology, I proposed a classification algorithm for the data streams with concept drift based on the incremental storage tree (CMCD-ST).Finally, based on the above studies, I implement the CMCD-ST algorithm in the form of plugin, and successfully apply it in the mining of commercial data stream with concept drift. Experiments show that the algorithm has very good ability in dealling with commercial data stream with concept drift and high classification accuracy.
Keywords/Search Tags:data streams, sliding window, self-sampling, Bayesian classification, concept drift, incremental storage tree
PDF Full Text Request
Related items