Font Size: a A A

Research On Data Stream Clustering Based On Grid And Density

Posted on:2010-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2178360275953317Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,because of the rapid development of computer and application technology,people's ability of obtaining data improves greatly.Data Stream is a type of important data source,and is subjected to more and more concern.Stream data is a kind of continuous,ordered,changing fast and huge amount data.It is quite a new object that is different from traditional static data stored on the disk.Currently,data mining on data stream becomes a hot research field.Clustering data stream is one of the hottest research points on it.One target on this thesis is to design and develop a data stream clustering algorithm,which is accuracy and high-speed.In order to reach this,we have done some work as follows.Background and relevant work on data stream mining is discussed.Popular clustering algorithms are summarized.The characteristics of data stream and key technical points on data stream clustering are researched.On the basis of these,we propose GDE-Stream(Grid and Density based Evolving Stream) algorithm,which is a framework based on grid and density.By modifying the synopsis data structure,This algorithm has the following characteristics.1.Borrowing the framework from CluStream algorithm,GDE-Stream is divided into online layer and offline layer.Online layer reads data stream rapidly,and stores relative information by synopsis data structure.With this,offline layer provide accurate clustering.The two layers work together to achieve the balance of accuracy and speed.2.The system preserves the characteristics of data stream by grid.In addition to summary statistics information,Grids also record the spatial information of data stream,which can reduce lose of information.3.On online layer,with the spatial information in Synopsis data structure, Online-Read algorithm compare the distances between the riew record and relative grids and map it to correct grid,which can solve the problem of the loss of information On the edge of grid partly.4.On offline layer,Density-based clustering algorithm is used,so that the system is sensitive to the datasets of arbitrary shape.The system can also satisfy the need of clustering and evolution history data stream,with the concept of grid frame and evolution difference.Experiments on both synthetic datasets and real dataset shows that the algorithm is applicability and accuracy and can cluster data stream efficiently.
Keywords/Search Tags:Data Stream, Clustering, Two-tier Framework, Grid, Density
PDF Full Text Request
Related items