Font Size: a A A

Research On Partition-Based Efficient Clustering Algorithm For Large-Scale Data

Posted on:2011-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:L L MengFull Text:PDF
GTID:2178360302494484Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of partitioning-based clustering algorithm, the k-means algorithm is widely used in various aspects. However, these algorithms can not cluster effectively on large databases. This paper has mainly concerned on how to incorporate grid-based methods constraints in the k-means algorithm, how to solve some problems in grid method, how to improve k-modes clustering algorithm and how to apply it to software security detecting. The research on these problems is important with broad applications, including pattern recognition, data analysis, market research and other cluster analysis of related processing, and so on.First of all, this paper presents an improved clustering algorithm based on grid-density, CABGD. In CABGD, we put forward a concept of Center Intensity of Grid Cell, CIGC. The distribution of data in some grid is identified by calculating the value of CIGC of the grid. CABGD solves the problem of low clustering accuracy that the division of grid by human leads to in the traditional grid-based algorithm. The results show that CABGD has higher accuracy than traditional grid-based clustering algorithm.Secondly, an improved k-means clustering algorithm based on grid, IKMG, is presented. Every single grid is regard as a basic processing unit in IKMG. First, the algorithm further improve the concept of CIGC from second chapter, and design a new data structure—clustering tree. The clustering tree is used to organize and adjust the cluster. The process of clustering is the process of constructing and merging the tree. Finally the clustering results is k cluster trees. For large-scale data, the algorithm has better time efficiency than traditional k-means algorithm, can detect clusters of arbitrary shapes and sizes and do not have to specify the k value at the beginning.Finally, the paper designs a data structure—similar feature tree, SFT. As a software fault detection tool, SFT is mainly used to improve software security detection efficiently. In order to construct a SFT, an improved k-modes clustering algorithm, IKMD, is presented. In IKMD, the choice of initial modes is carried out in the entire clustering process. The clustering results are given in the form of k trees. Finally, the forests containing k trees are conversed to a binary tree , that is the SFT. According to the character of apriori, SFT detect fault of software program by the principles of the first left and then right. The results showed that the method is feasible and effective.
Keywords/Search Tags:partitioning clustering, clustering, k-means, grid, k-modes
PDF Full Text Request
Related items