Font Size: a A A

Research On Clustering Algorithm Over High Dimensional Data Stream Based On Grid And Sequence Data

Posted on:2011-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:R X YaoFull Text:PDF
GTID:2178330338991002Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In present, clustering over data stream plays an important role in data mining. Presented clustering algorithms grid-based have the capability of efficiency, but the cluster quality is directly influenced by the grid granularity and unable to deal with the high-dimensional data streams. In order to address above problems, This paper has mainly focused on how to improve the cluster quality of algorithms based on grid and density over data stream, and also deal with the problem of clustering over high dimensional data stream, which are important data mining problems with broad applications, including network security, wireless sensor, and industrial control.First, an irregular grid-based clustering algorithm over high-dimensional data streams is developed. An irregular grid structure is dynamically maintained and generated by means of splitting each dimension into different grid cells. When the request of clustering is arriving, the final clusters are obtained in subspaces which are formed by dimensions associated with corresponding clusters.Second, clustering algorithm based on grid and matrix over high dimensional data stream is proposed. The algorithm adopts the two-phased framework. In the online component, the GC is employed to monitor one-dimensional statistics data distribution of each dimension independently. Sparse GCs which need to be deleted are checked by predefined threshold. In the offline component, grid matrix structure is generated by these dense GCs. When the request of clustering is arriving, the final multi-dimensional clusters are got by pointer traversaled the whole data space.Finally, we propose a new similarity method and a sequence clustering algorithm. The number of common sequence elements contained in fault feature sequences is calculated to measure the relationship among sequences. And the similarity method also monitors the degree of normalization of fault feature sequences to get more accurate cluster results. In the clustering stage, micro-clusters are merged into k macro-clusters to meet the requirment of users by the similarity metric of micro-clusters.Through the operation of cluster for software fault feature, it is interesting to decrease the range of matching of fault features.The above algorithms are implemented with java language. Experimental results show that these algorithms proposed in this paper obtain the higher cluster quality than the current ones, and the anticipated results are realized.
Keywords/Search Tags:clustering analysis, high dimension, irregular grid, grid matrix, sequence data
PDF Full Text Request
Related items