Font Size: a A A

Research On Clustering Technique Based On Self-similarity And Grid Over Data Stream

Posted on:2013-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X J QiFull Text:PDF
GTID:2218330362963215Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In data mining field, the data stream clustering analysis has become an importantresearch. In the algorithm of clustering over data stream, the traditional clusteringalgorithm has its obvious limitations, which promotes researchers to improve the existingdata mining algorithms and propose the new algorithms. In this paper, the fractal theory isused, the clustering algorithms over data stream based on fractal self-similarity areproposed. Meanwhile, the outlier mining algorithm based on fractal clustering is offered.First of all, a data stream clustering algorithm based on the fractal self-similarity isproposed. To get the fractal dimension, the nested grid data structure is applied, thestatistics of data set can be got by scanning data set once. Then, based on the single fractalinfluence degree, the average fractal influence degree is presented. For each cluster, afterone point adding into, the fractal influence degree is not same. According to the fractalinfluence degree, the data points will be dynamically put into the cluster whose fractalinfluence degree is minimal.Secondly, a high dimensional data stream clustering algorithm based on fractal andgrid is offered. In this algorithm, all the high dimensional data will be stored in the girds.While the request of clustering comes, the feature attributes of high dimensional data canbe got by the fractal method. The fractal dimension of all data set is will be treated as thenumber of the feature attributes. Then, the high dimensional data will be projected into thesubspace formed by the feature attributes. Meanwhile, in the subspace, the grid will be asdata unit, the grid unit will be clustered by fractal method.Finally, an outlier mining algorithm based on fractal clustering over data stream isproposed. In this method, the data space will be divided into many non-intersect grids.The grids will be clustered using the grid based algorithm. The adjacent dense grids willbe regarded as clusters. However, the sparse grids, which are not clustered, will be put intothe existing clusters, and then the outlier degree can be calculated according to the fractaldimensions. If the outlier degree of one grid is greater than a predefined threshold, thedata points in sparse grid are outliers. Otherwise, the sparse grid may be developed to thedense grid, so the grid will be added into the cluster which has the minimal outlier degree. The above algorithms are implemented with C++language. Experimental resultsshow that these algorithms proposed in this paper obtain the higher cluster quality than thecurrent ones, and the anticipated results are realized.
Keywords/Search Tags:data stream, cluster, fractal, grid, outlier
PDF Full Text Request
Related items