Font Size: a A A

Research Of Data Stream Clustering Methods Based On Grid

Posted on:2011-12-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YuFull Text:PDF
GTID:1118330332460588Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, the technology of data collection and data mining developed continuously, then we can get massive data to process and analyze in very short time. With the rapid development of information technology and Web technology, data are no longer static data stored in the media which can be visited several times, but the flow-type data called data stream. Other than static data, data stream has the characteristics of real-time property, continuity property, succession property, etc. Traditional clustering analysis technology can not process data stream directly, and new technology is required to process data stream. This paper makes deep and detailed study on the clustering technology of data stream from several scales.First, we analyze the advantages and disadvantages of the clustering algorithms based on grid, then further make the study on traditional static grid partition methods and dynamic ones. We propose a new dynamic partition strategy of data space which improves the partition methods in clustering algorithms based on grid and updates the structure of grid cells and the statistics information. On the basis of these, we propose the clustering algorithm based on dynamic grid partition which has high effectiveness as traditional clustering algorithm based on grid does and improve the clustering quality in a way.Second, based on the new proposed dynamic partition strategy, we then put focus on the study of data stream incremental clustering. Have studied the characteristics of the current data stream clustering algorithms, current incremental clustering algorithm and the existing disadvantages, we plan to design a new algorithm which is an incremental algorithm based on irregular grid for clustering data stream to process the disadvantages of current clustering algorithm's ineffectiveness of clustering spherical clusters and to meet the requirements of real-time property, etc. Compared to other algorithms, the proposed algorithm should have the advantage of high efficiency, meanwhile, it adjust the grid structure continuously and dynamicly. The new proposed algorithm judges whether grid cells are connected, and guarantees the clustering results of different shape of clusters. When clustering, it needn't to provide the number of clusters and it should not be sensitive to outlier. Generally, the grid cells contains outliers can not exceed the threshold of density and can be removed by pruning strategy to reduce the complexity of algorithm which won't influence the process of clustering. The dynamic partition strategy of grid cells reflects the current distribution of data stream, and improves the clustering precision to some extent.Further, based on the analysis of high dimensional clustering methods, dimension reduction methods and the their application in data stream, we propose a new algorithm, it is an incremental algorithm based on rough reduction for clustering data stream to solve the problem of data sparsity and the skewness of attribute importance. The proposed algorithm should compute the importance of each attribute by attribute reduction method without decision attribute, then update attribute set. It adds the attributes with high importance to the attribute set, and removes those with low importance. With the guarantee of precision, it should adjust the attribute set and improves the algorithm efficiency.Finally, based on the analysis of current data stream subspace clustering algorithms, we propose a new algorithm, it is a data stream subspace clustering algorithm whose aim is to solve the inefficiency of current data stream subspace clustering algorithms. The proposed algorithm should use a bottom-up strategy and partitions data space into domains on each dimension with fully consideration of the distribution of data points, then find subspaces by overlapping domains and further generates clusters. It should have high efficiency and should not be sensitive to outliers and can find clusters in subspaces from high dimensional data stream effectively. In addition, partition the domains repeatedly according to the changes of data stream which can reflect the status of data stream.This dissertation is mainly carried on data stream clustering. And the validity, efficiency of each new method is demonstrated by simulate experiments. With the research on data mining technology, artificial intelligence and rough set theory involved, it provides a good theoretical basis and idea for future research.
Keywords/Search Tags:data stream, grid, incremental clustering, dynamic partition, reduction
PDF Full Text Request
Related items