Font Size: a A A

The Study Of Application And Analysis About Clustering Algorithm In Data Mining

Posted on:2003-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhengFull Text:PDF
GTID:2168360092466059Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Clustering is an important area of application for a variety of fields including data mining and is an important method of data partition or grouping. Clustering has been used in various ways in commerce, market analysis, biology, Web classfication and so on. So far, there are 5 kinds of clustering algorithm includes partitional algorithm, hierarchical algorithm, density-based algorithm, grid-based algorithm and model-based algorithm. But there are many disadvantages in these clustering algorithms, for example, working only on numeric values, efficiency, sensitive to initial starting conditions, sensitive to the order of data input, best solutions, relying on parameters inputed and so on. DBSCAN is a density-based clustering algorithm that can efficiently discover clusters of arbitrary shape and can effectively handle noise. But, there are two disadvantages eager to overcome.one is that it requires large volume of memory support, especially dealing with large-scale database. Another is that it requires determining the global parameter Eps. Once Eps is not appropriate, clustering quality will be reduced, especially when the cluster density and the distance between clusters are not even. In this paper, I use the idea of divided and conquered. Before clustering, I divided all data into partitions called grids. Then handled every grid in different CPU respectively. At last, combined results of different CPU.After that, on the one hand, degrade large volume of memory. On the other hand, when the cluster density and the distance between clusters are not even, global parameter Eps will not influence cluster quality, because every grid determines own parameter Eps respectively. Experiment demonstrates that large volume of memory and dependence on Eps both are reduced.K-means is a partitioning algorithm that constructs a partition of a database of n objects into a set of K clusters where K is an input parameter. Clustering use an iterative procedure, if this algorithm converges to one of numerous local minima, it terminates and outputs result. So it is obvious that outputs are especially sensitive to initial starting condition for random selections about K initial starting points, which will lead to bad solutions, so the quality of cluster relys on the initial starting ponts. In this paper, I analyse the method of random selection and propose a method of searching initial starting points through aiming at target many times. We can demonstrate that the improved K-means algorithm can get better solution and is little sensitive to initialstarting points.Finally, on the base of former work, the author used improved K-means algorithm to cluster on the data of Chongqing medicinal company and demonstrated that the improved algorithm is effective and correct.
Keywords/Search Tags:Data Mining, Cluster, Algorithm, DBSCAN, K-means
PDF Full Text Request
Related items