Font Size: a A A

Stream-based Clustering Algorithm

Posted on:2010-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2208360278469504Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, the technology of computer application develops with high speed, people have improved the ability of accessing and obtaining the data. As a important data source, data stream has got more and more attention, the clustering algorithms based on data stream have become an important topic.Different from traditional databases, data stream has the following characteristics: infinite scale of data, rapid arriving rate of data, and uncontrolled ability of tuples' arriving order. Because data stream has above characteristics, it is essential to advance a high-quality clustering algorithm to get accurate results.This paper presents an improved dual-tier data stream clustering algorithm named HSCS, which is divided into the fast-calculation layer and the accurate-calculation layer. The fast-calculation layer is the process which collects and pre-processes the data stream, it is the basis of the dual-tier data stream clustering algorithm. In the fast-calculation layer, the algorithm uses the idea of equal-time span sliding windows. It uses hash function to sample the datas in the sliding windows and then deals with them to get the abstract information of data stream, and input the abstract information into the accurate-calculation layer. The accurate-calculation layer is the offline analysis part of the dual-tier clustering algorithm, it have more freedom to get accurate clustering results with different methods. In the accurate-calculation layer, we use the sampled datas from the fast-calculation layer as data source. In order to get a better final result, we use DBSCAN, which is a density-based clustering algorithm, to deal with the datas.The experimental results gained from the real data sets show that the algorithm is able to reflect the overall distribution of data stream through the sampling of data analysis, but also can reduce the algorithm's storage requirements, and it has a good feasibility and effectiveness.
Keywords/Search Tags:Data stream, Sliding window, Clustering algorithm
PDF Full Text Request
Related items