Font Size: a A A

Research On Clustering And Classification Algorithm Of Streaming Data

Posted on:2009-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZouFull Text:PDF
GTID:2178360242493651Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, with the development of information processing, the data is modeled as continuous transient data stream rather than as persistent relations in many application areas including financial applications, network traffic monitoring, sensor networks. This kind of data is known as data stream. Unlike the traditional data bases, data stream has the following distinguished characteristics:(1) unbounded volume of data; (2) rapid arriving rate of data; (3) uncontrollability of tuples'arriving order; (4) being processed only once for each tuple, except being reserved.Those characteristics propose the following requirements on data streams mining: Firstly, algorithms should be in an online process, and be able to fast handle each tuple, and output the mining result on time; Secondly, because of the limited memory compared with unlimited data streams, algorithms should have low space complexity, the space complexity should be in the range of logarithmic bound of data volume. Thirdly, by the constraint of online mining and low space complexity, algorithms can only get approximate results, however, that a certain precision of the results should be guaranteed. At last, algorithms should be adaptive, including the applicability on data streams with evolving underlying model, the abilities on handling outliers.Researches have made efforts on desighning efficient algorithms for mining data streams, however there are many problems need to be researched and resolved. In this paper, we study the problem of clustering data streams and classifying data stream. The main contributions of this paper include the following aspects:(1) We propose a clustering method for multiple data streams based on correlation coefficient. We use the correlation coefficient to measure the similarity between data streams. The attenuation coefficient was introduced to improve the performance of clustering. We use sliding window technique to reflect the behavior of clustering structure. A dynamic k-means algorithm is proposed to adjust the number of cluster and improve the clustering quality. In addition, based on correlation coefficient, we propose a COR framework to satisfy user's clustering demands. The framework consists of two phases, i.e., the foreground phase and the background phase. The foreground phase provides a novel mechanism to maintain the statistical information of the data stream. On the background phase, the approximation of desired substreams is retrieved according to clustering queries.(2) A new algorithm for clustering multiple data streams using a novel similarity metric is proposed. Based on spectral component similarity analysis, the algorithm can effectively cluster streams which show similar behavior with some unknown time delay. Auto-regressive modeling technique is exploited to measure lag correlation between data streams, which is used as the distance metrics for clustering. Based on a sliding window model, the algorithm can continuously report the most recent clustering results, and dynamically adjust the number of clusters.(3) An algorithm named GDCS for clustering stream data using a density-based approach is proposed. The algorithm uses an online component which maps each input data record into a grid, and an offline component which computes the density of the grid and clusters the grids based on their density. Furthermore, a theoretically sound technique is developed to detect and remove sporadic grids which only consist of outliers in order to dramatically improve the space and time efficiency of the system. Experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams.(4) A modified Fisher discriminate analysis method is proposed to satisfy the real-time demand in data stream classifying. This method not only has a high efficiency but also overcomes the singular problem of within-class scatter matrix. The algorithm trains the model within sliding widow and rebuilds the model at a constant rate. The model learned can capture the up-to-data trends in the stream. Experiment results show that our algorithm can speed up the mining efficiently.
Keywords/Search Tags:Data stream, Clustering, correlation coefficient, AR model, spectral component, Classification, Fisher Discriminate Analysis
PDF Full Text Request
Related items