Research On Clustering Algorithm Over High Dimensional Data Stream Based On Grid And Sequence Data

Posted on:2011-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:R X Yao

Full Text:PDF

GTID:2178330338991002

Subject:Computer application technology

Abstract/Summary:

In present, clustering over data stream plays an important role in data mining. Presented clustering algorithms grid-based have the capability of efficiency, but the cluster quality is directly influenced by the grid granularity and unable to deal with the high-dimensional data streams. In order to address above problems, This paper has mainly focused on how to improve the cluster quality of algorithms based on grid and density over data stream, and also deal with the problem of clustering over high dimensional data stream, which are important data mining problems with broad applications, including network security, wireless sensor, and industrial control.First, an irregular grid-based clustering algorithm over high-dimensional data streams is developed. An irregular grid structure is dynamically maintained and generated by means of splitting each dimension into different grid cells. When the request of clustering is arriving, the final clusters are obtained in subspaces which are formed by dimensions associated with corresponding clusters.Second, clustering algorithm based on grid and matrix over high dimensional data stream is proposed. The algorithm adopts the two-phased framework. In the online component, the GC is employed to monitor one-dimensional statistics data distribution of each dimension independently. Sparse GCs which need to be deleted are checked by predefined threshold. In the offline component, grid matrix structure is generated by these dense GCs. When the request of clustering is arriving, the final multi-dimensional clusters are got by pointer traversaled the whole data space.Finally, we propose a new similarity method and a sequence clustering algorithm. The number of common sequence elements contained in fault feature sequences is calculated to measure the relationship among sequences. And the similarity method also monitors the degree of normalization of fault feature sequences to get more accurate cluster results. In the clustering stage, micro-clusters are merged into k macro-clusters to meet the requirment of users by the similarity metric of micro-clusters.Through the operation of cluster for software fault feature, it is interesting to decrease the range of matching of fault features.The above algorithms are implemented with java language. Experimental results show that these algorithms proposed in this paper obtain the higher cluster quality than the current ones, and the anticipated results are realized.

Keywords/Search Tags:

clustering analysis, high dimension, irregular grid, grid matrix, sequence data

Related items

1	Research On Clustering Algorithm Based On Irregular Grid And Subspace Of Descending Dimension
2	Research On Clustering Algorithm Over High Dimensional Data Stream Based On Irregular Grid Data
3	Research On Data Mining In The Scientific Data Grid
4	Grid-based Clustering Algorithm With Referential Values Of Parameters
5	An Incremental Grid Clustering Algorithm Based On Density-dimension-tree
6	Research On Clustering Algorithm For Heterogeneous Objects Based On Information Dissimilarity And Irregular Grid
7	Research And Application Of Text-related Sentiment Analysis Based On Grid Tags
8	Study On Grid-based Clustering Algorithms
9	Research On Data Stram Clustering Algorithm Based On Similarity And Grid Partition Optimization
10	Research On Data Stream Clustering Algorithm Based On Density Grid