Research Of Data Stream Clustering Methods Based On Grid

Posted on:2011-12-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Yu

Full Text:PDF

GTID:1118330332460588

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, the technology of data collection and data mining developed continuously, then we can get massive data to process and analyze in very short time. With the rapid development of information technology and Web technology, data are no longer static data stored in the media which can be visited several times, but the flow-type data called data stream. Other than static data, data stream has the characteristics of real-time property, continuity property, succession property, etc. Traditional clustering analysis technology can not process data stream directly, and new technology is required to process data stream. This paper makes deep and detailed study on the clustering technology of data stream from several scales.First, we analyze the advantages and disadvantages of the clustering algorithms based on grid, then further make the study on traditional static grid partition methods and dynamic ones. We propose a new dynamic partition strategy of data space which improves the partition methods in clustering algorithms based on grid and updates the structure of grid cells and the statistics information. On the basis of these, we propose the clustering algorithm based on dynamic grid partition which has high effectiveness as traditional clustering algorithm based on grid does and improve the clustering quality in a way.Second, based on the new proposed dynamic partition strategy, we then put focus on the study of data stream incremental clustering. Have studied the characteristics of the current data stream clustering algorithms, current incremental clustering algorithm and the existing disadvantages, we plan to design a new algorithm which is an incremental algorithm based on irregular grid for clustering data stream to process the disadvantages of current clustering algorithm's ineffectiveness of clustering spherical clusters and to meet the requirements of real-time property, etc. Compared to other algorithms, the proposed algorithm should have the advantage of high efficiency, meanwhile, it adjust the grid structure continuously and dynamicly. The new proposed algorithm judges whether grid cells are connected, and guarantees the clustering results of different shape of clusters. When clustering, it needn't to provide the number of clusters and it should not be sensitive to outlier. Generally, the grid cells contains outliers can not exceed the threshold of density and can be removed by pruning strategy to reduce the complexity of algorithm which won't influence the process of clustering. The dynamic partition strategy of grid cells reflects the current distribution of data stream, and improves the clustering precision to some extent.Further, based on the analysis of high dimensional clustering methods, dimension reduction methods and the their application in data stream, we propose a new algorithm, it is an incremental algorithm based on rough reduction for clustering data stream to solve the problem of data sparsity and the skewness of attribute importance. The proposed algorithm should compute the importance of each attribute by attribute reduction method without decision attribute, then update attribute set. It adds the attributes with high importance to the attribute set, and removes those with low importance. With the guarantee of precision, it should adjust the attribute set and improves the algorithm efficiency.Finally, based on the analysis of current data stream subspace clustering algorithms, we propose a new algorithm, it is a data stream subspace clustering algorithm whose aim is to solve the inefficiency of current data stream subspace clustering algorithms. The proposed algorithm should use a bottom-up strategy and partitions data space into domains on each dimension with fully consideration of the distribution of data points, then find subspaces by overlapping domains and further generates clusters. It should have high efficiency and should not be sensitive to outliers and can find clusters in subspaces from high dimensional data stream effectively. In addition, partition the domains repeatedly according to the changes of data stream which can reflect the status of data stream.This dissertation is mainly carried on data stream clustering. And the validity, efficiency of each new method is demonstrated by simulate experiments. With the research on data mining technology, artificial intelligence and rough set theory involved, it provides a good theoretical basis and idea for future research.

Keywords/Search Tags:

data stream, grid, incremental clustering, dynamic partition, reduction

PDF Full Text Request

Related items

1	A High Dimensional Data Stream Clustering Algorithm Of Quick Dimension Reduction
2	The Research Of Grid-based Parallel Clustering Algorithm And Clustering For Data Stream
3	Research On Data Stram Clustering Algorithm Based On Similarity And Grid Partition Optimization
4	Research On Algorithm And Application Of Dynamic Grid-based Clustering Over Data Stream
5	Research On Data Stream Clustering Algorithm Based On Sliding Windows And Subspace Partition
6	Research On Incremental Clusteirng Algoirthm Based On PDStream
7	Research On Dynamic Measurement Based Data Stream Clustering And Its Applications
8	An Incremental Grid Clustering Algorithm Based On Density-dimension-tree
9	Research On Data Stream Clustering Based On Grid And Density
10	Research On Dynamic Incremental Clustering Algorithm