Font Size: a A A

Research On Clustering Algorithm Of Streaming Data

Posted on:2011-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:J C LiFull Text:PDF
GTID:2178360305472973Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The object of traditional data mining is traditional data or static data, which comes mainly from the relational databases, data warehouses and transaction databases. However, with the rapid development and wide applications of information technologies, a new kind of dynamic data set grows rapidly which takes high speed, continuous, dynamic and fast change as its features. Therefore, what we lack is not sufficient data, but the technique of dealing with so large streaming data. As the characteristics of streaming data stated previously, how to make use of limited memory space and processing speeds of computer for rapid and accurate data mining has become an important research topic on the streaming data of cluster analysis.This dissertation does a research on PMC, which is divided into online processing and offline clustering. It is based on the idea of CluStream algorithm that contains online and offline two parts. CluStream algorithm takes single data object as processing unit which affect clustering efficiency and can't achieve good effect for arbitrary shapes clustering. In contrast, the online part of PMC algorithm analyze data stream uses two groups of processing unit respectively, processing units intercept and analyze batch data alternately, the problem that breakpoint of batch processing data stream affect accuracy of clustering can be solved, the processing speed by batch processing is faster compared with a single processing unit. What is more, the minimum spanning tree algorithm is used in online process, the clustering of data set distributed slantwise can be processed by cutting the most inconsistent side, the fixity of cluster number can be guaranteed when batch processing in STREAM algorithm, so summary information of data stream and specific information of some data objects which are of higher quality can be obtained, with using pyramidal time frame model these online information are stored in the form of snapshot timely, then clustered by offline clustering algorithm. Cluster is taken as representative objects in offline process, and the minimum spanning tree algorithm is used in offline clustering. The Shortcoming of CluStream algorithm that can't achieve great effect for arbitrary shapes clustering is overcome in PMC algorithm, so the quality of cluster is improved greatly.This dissertation does a large number of experiments in both real data sets and artificial data sets. It is proved that PMC algorithm can not only deal with arbitrary shapes cluster effectively, but also has better efficiency and quality of clustering.In addition, it is not sensitive to the sequence of input data, and have a good effect for skewed class distributions.
Keywords/Search Tags:cluster analysis, data stream, minimum spanning tree, multi-processing unit
PDF Full Text Request
Related items