Font Size: a A A

Research Of Data Stream Clustering Methods Based On Density

Posted on:2015-10-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:B GaoFull Text:PDF
GTID:1318330518972867Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the development of information technology,data stream as a common data form has attracted more and more attention of the data mining researchers.In contrast to static data,which stored in media of random query,data stream has the token of continuous?intime and sequence which does notmake the traditional methods available.The researchers have done a lot of work on the clustering problem of data stream,and proposed many quickly clustering algorithms to provide people the growing valuable information to make decision.However,because of the complexity and diversity of data steam,these algorithms need to be improved to meet the new conditions and demands.There are many problems need to be researched and resolved,such as the improving the accuracy of clustering,finding the different density clusters and outliers,finding the different shape clusters on the distribute data stream or the uncertain data stream.In this paper,we study the problem of clustering data streams This paper aims the clustering analysis mission on the data stream,using density-based clustering technology,makes deep and detailed study on the four below aspects:Firstly,the algorithms on clustering uncertain data stream are mostly based on the ideology of partition,which are difficulty to find arbitrarily shape of clusters.In addition,the existing density-based algorithms aren't able to solve the problem on the attribute-level uncertain.We propose the expectation distance criterion to measure the uncertain of the grids,which analyzes the clustering impact of attribute-level uncertain and considers the two factors:the number of points in grid and the uncertainty of grid;at the same time,we define the new density threshold and the grid fading standard,then classify the grads and design the clustering algorithm to catch the change of clusters.Combined the fading window technology,we propose a grid density-based uncertain data stream algorithm(DBUSC),which finds the neighbor grids whose density is beyond the dynamic density threshold to get the clusters result.At last,experiments show:compared with conventional distance-based methods,the uncertain data stream algorithm DBUSC has the merits of finding non-spherical clusters and does not need the number of clusters,can get the better cluster quality while need the less time.Secondly,the micro-clusters accepted in micro-cluster-based stream algorithms don't keep the information of data stream,affect the cluster accuracy,and reduce the efficiency of algorithm by using two-phase method.We propose the representative point structure as synopsis to save the density information of data stream,define the circle point to get the clusters by searching iteratively them to find the connected representative point.In addition,by defining temporal weight of representative point,we propose the representative-based data stream algorithm(RB-Stream),use the test-update strategy to find the representative point whose weight is under the threshold or increasing.The algorithm improves the efficiency while finds the new clusters and the cluster disappear.At last,experiments show:compared with micro-clusters algorithms,the algorithm RB-Stream can get the better cluster accuracy,need the less time.Thirdly,the exiting density-based clustering stream algorithms mostly apply to the stream with the constant density,and can't find the clusters with the different density.Furthermore,with the data flowing in,it is difficult to discover the changeable clusters and outliers.On the base of the shared nearest neighbor graph,we define the SNN density,consider the degree that data object is surrounded by nearest neighbors and the degree that data object is demanded by around data objects.The clustering result is from the influence of the density variation.In addition,we define the average distance of data object and the cluster density to identify outliers and clusters with bridge.Then we maintain the renewal of clusters on shared nearest neighbor graph over the sliding window,propose the SNN density-based data stream algorithm(SNDStream).The algorithm searches the connected components in SNN graph to get the result.At last,experiments show:the algorithm SNDStream can get arbitrary shape clusters with different density,can correctly find outliers and the chain bet-ween clusters.The algorithm has the better cluster quality,is suitable to the changeable clusters without specifying the number of clusters.Finally,it is important to find the clusters of arbitrary shapes under the distributed data streams environment,but the existing distributed stream clustering algorithms which based on the distance or model can't deal well with the non-spherical clusters.We propose the distribute data stream clustering algorithm(RB-DDSC).The algorithm has two phases:first,on the base of the representative point,the local model generated at the remote site is sent to the coordinator site,then generate global clusters by combining the local models at coordinator site.Furthermore,we design test-update local model algorithm avoid frequently sending data when the data stream is stable and reduce the data transmission.At last,experiments show:the algorithm RB-DDSC can get arbitrary shape clusters in distributed data streams and reduce the data transmission by using the representative point and updating strategy.
Keywords/Search Tags:Data mining, Data stream, Clustering analyse, Density, Representative point
PDF Full Text Request
Related items