Font Size: a A A

Grid-based Clustering Algorithm With Referential Values Of Parameters

Posted on:2009-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:X D YiFull Text:PDF
GTID:2178360242990837Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering that has a wide range of applications is an important research field in data mining. Research on clustering algorithm is a difficult work. At present, many popular clustering algorithms such as k-means,k-medoids,BIRCH,CURE,DBSCAN,and STING have been applied successfully in many fields, but many new challenges are emerging, such as the blindness of parameter assignment, the handling of large-scale datasets and high-dimensional datasets, etc. Some parameters of most clustering algorithms must be assigned manually by users in the process of clustering. If the value of those parameters is not appropriate, the results of clustering will be very bad. So the assignment of parameters is a hard work, especially for general users. It usually brings about the blindness of parameter assignment.For grid-based clustering algorithm, density threshold is a key parameter. To reduce the blindness of density threshold, a new kind of clustering algorithm called GRPC was presented in this thesis. GRPC is abbreviated from a grid-based clustering algorithm with referential value of parameters, which is based on grid algorithm and adopts density thinking and data distribution evolvement thinking。Density threshold formula is proposed according to data distribution evolvement. And some different density thresholds are computed out by means of Density threshold formula. With the help of these density thresholds, it can not only cluster general data but also segregate high-density clusters from low-density clusters.Algorithm analysis indicates that the time complexity of GRPC is less than DBSCAN's. Simulation shows that this new algorithm can differentiate between outliers or noises and clusters effectively, discover clusters of arbitrary shapes, with good clustering quality and cluster different clusters whose level of clustering is not same each other.The further work of this thesis is to develop GRPC algorithm in the aspects of scalability and high-dimensionality. Two-dimensional subspace clustering method is used to cluster high-dimensional data, which transforms the clustering of high-dimensional data space into the clustering of two-dimensional data subspaces. Simple random technology is applied to sample large-scale data set. After clustering sample data set, non-sample data set is clustered by means of the clusters of sample data set. The extended GRPC can cluster not only small-scale data set but also large-scale data set and cluster not only two-dimensional data but also high-dimensional data. The time complexity of the extended GRPC algorithm is not only related with sampling rate and the number of data points of data set but also related with the square of dimensionality. Simulations show that the development of GRPC algorithm is entirely feasible.
Keywords/Search Tags:DataMing, Clustering, Density threshold, Grid, High dimension
PDF Full Text Request
Related items