Font Size: a A A

The Two-tier Framework For Clustering Data Stream: Design And Implementation

Posted on:2005-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2168360125450922Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data Mining, also referred to as Knowledge Discovery from Database, is to abstract the potential, unknown and useful information or pattern from the large database or data warehouse. It integrates the theories and technologies of database, artificial intelligence, machine learning and statistics etc. And it is a very promising study field in the research of database with the high valuable application. Clustering Analysis is one of the most important technologies in data mining. According to some attributions of the objects, clustering groups them into some clusters. The similarity in the clusters should be as small as possible, and the one among the clusters as large as possible, that is, stuffs group with their classes and people group with their congeners. Clustering itself is a discovery process, and the results from the process can explain the natural characteristic of the data distribution. At the same time, clustering establishes the base of other data mining technologies application.In recent economic activities, many huge organizations generate millions of records every day. And in scientific research, billions of bytes are usually collected every day. For such magnitude of data, it is very significant to apply the data mining technologies, especially clustering analysis, to abstract the interesting knowledge and patterns. But it is impossible for the current algorithms to finish the mission of mining in the effective time. That is because the considered data suppose to be loaded in a stable and less updated database in the tradition data mining process. Facing the large, infinite and fast data stream, the data mining system should run at the arrival speed of data. It is necessary to convert our conception to mine data stream instead of the static data in database.Among the one-level clustering algorithms for data stream, the one of most famous algorithms is STREAM which was designed by Sudipto Guha et al, the members of Stream Group in Stanford University. Having been improved for about four years, STREAM has not only presented the applicability in practice, but also been proved the time and space complexities in theory. If considering the satisfaction for the characteristics of data stream only, the one-level algorithm of clustering data stream has shown the powerful dominance. Clustering, however, is a problem with the high applicability, and our view must be switched to the practice. At the same time which we satisfied the quality demand of clustering, we also need to satisfy the users' requirements to get the clustering results from the different applicable points of view. However, those are the aspects which the one-level algorithm ignores. It is natural that we should break the current framework of the one-level clustering algorithm for data stream, and put forward a more effective framework. As a result, the two-tier framework for clustering data stream is eventually born. The main achievement in this paper is to design and realize the two-tier framework for clustering data stream, which includes two parts, the Quick Compute Level and the Complex Analysis Level. We introduce two concepts: micro-cluster and pyramidal time framework for store the statistical information in the data points more effectively. The statistical information in data points is retained as the form of micro-cluster, and stored In terms of the pyramidal time framework. The micro-cluster and the pyramidal time framework establish the good base of data stream flowing in the two-tier framework. The Quick Compute Level is the online process which collects and pre-processes the data stream. It depends on no input from users, such as the number of clusters and the granularity of clustering. The objective of the Quick Compute Level is to maintain the statistical information on the appropriate granularity (temporal and spatial), and make them quite useful on the next level, the Complex Analysis Level. The algorithm always maintains q micro-clusters at any time. When a micro-cluster is created first, we create a unique...
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items