Font Size: a A A

Algorithms For Data Stream Mining

Posted on:2008-09-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z W YinFull Text:PDF
GTID:1118360242976034Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and widespread use of computer network, database and multimedia technology, data stream model has been widely applied in a growing number of information processing applications, such as sensor network, financial analysis, data mining, manufacturing, chronometer et al. Compared with traditional data models, data stream model owns several distinguished characteristics: the volume of a stream is unbounded; a rapid response should be produced in data stream systems; the data distribution of stream may be changed continually. Traditional data mining techniques can hardly be applied to process data streams directly. This make researchers try hard to work out novel querying and processing techniques over data streams.In this dissertation, some principal techniques for data stream mining have been studied: frequent count mining, data stream classification, load shedding and feature selection. The contributions of this dissertation include:1. The HCOUNT+ algorithm is proposed to mine frequent items over data streams. The HCOUNT+ algorithm adopts aided measures to greatly improve the precision of the HCOUNT algorithm and it can estimate the frequency of each element with error no more thanεwith (1 +α)·(e/ε)·ln(-M/(lnp))(α<1) counters. In addition, HCOUNT+ is introduced to time critical applications and a novel sliding windows-based algorithm SL-HCOUNT+ is proposed to mine the most frequent items occurring recently instead of only those TOP ? Kitems. Theory analysis and experiments demonstrate that HCOUNT+ algorithm and SL-HCOUNT+ algorithm have a low time complexity and space complexity. Both algorithms are verified to have high query precision in experimental results.2. The Evolutionary Logistical Regression Classifier algorithm (ELRClass) is proposed to solve the classification problems of evolving data streams. Logistic Regression is a fast classifier and can achieve higher accuracy on small training data. Moreover, it can work on both discrete and continuous attributes with nonlinear patterns. The ELRClass algorithm applies Logistic Regression continually to a sliding window of samples in order to update the existing classifier, keep this classifier if its performance is deteriorated by the bursting noise or construct a new classifier if a major concept drift is detected. Intensive experimental results demonstrate the effectiveness of this algorithm.3. Several algorithms based on support vector machine are proposed for multi-class classification of data streams. This paper firstly propose one incremental version for each of the one-against-one, one-against-rest and DB2 algorithms, called One2One, One2All and LDB2 respectively, which are enabled the ability of multi-label classification for large datasets. They would preserve only the support vectors at each incremental step, add them to the training data for the next step, and construct several binary classifiers after the processing of the last batch data. In the testing phase, the average number of classifiers used by LDB2 is log2N , which is less than N used by One2All and much less than N ( N - 1)/ 2 used by One2One. Hence, LDB2 is faster than One2All and One2One in terms of testing time. Experimental results show that LDB2 has higher cross-validation accuracy than One2One and One2Rest. Then an adaptive method called IncreDB2, based on the DB2 method, is proposed to detect and adapt to local concept drift continuously in data stream classification. This method dynamically maintains a hierarchical classification tree. When local concept drift is detected, IncreDB2 only updates the nodes that affected by this drift rather than rebuilds a new classification tree from scratch, which means that it has better time efficiency. Experimental results demonstrate the validity and efficiency of this method.4. The Rate-sensitive Load Shedding algorithm (RLS) is proposed to determine the optimum drop location. Traditional load shedding algorithms for data stream systems calculate current operator selectivity over several run periods and use them to determine where to shed load during the next run period. This paper firstly point out that the current selectivity may change due to the implementation of load shedding. Then, the RLS algorithm is introduced to determine the optimum drop location by these changed selectivity rather than those pre-calculated values. Simulation results demonstrate that RLS achieves higher accuracy than traditional algorithms.5. Two feature selection methods called Projected OCFS (POCFS) and Projected OCFS+ (POCFS+) are proposed for data stream classification. The Orthogonal Centroid Feature Selection (OCFS) method can ensure optimal solutions according to the Orthogonal Centroid criterion (OC). Both POCFS and POCFS+ extend OCFS to select different features for each class pair individually instead of selecting the same features for all the classes simultaneously. They can select more suitable features for classifier construction than OCFS. In addition, POCFS+ introduces decrement ratios of feature scores to accelerate the speed of determining the number of features to be selected. As a result, POCFS+ selects features faster than OCFS and POCFS. Experimental results indicate that POCFS and POCFS+ significantly improve the classification efficiency and POCFS+ outperforms OCFS in terms of effectiveness and efficiency.
Keywords/Search Tags:Frequent count mining, classification, support vector machine, load shedding, feature selection, data stream mining
PDF Full Text Request
Related items