Font Size: a A A

Clustering Method Research Based On Divided And Conquered Method

Posted on:2012-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:J F JiaFull Text:PDF
GTID:2218330368489238Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
In data mining, clustering analysis is an important research field, whose studies on the content, methods and tools are widely used in real life. Such as financial fraud, medical diagnosis, image processing, information retrieval and biological sciences. In recent years, clustering algorithms has become a very popular field of study and achieved fruitful results. However, as the continuous development of science and technology, together with constantly expanding of the size of the data, there has been categorical data and mixed data, and the studies is not just limited to numerical data. The two kinds of new coming data, with their high-dimension and large numbers, have a sparse data distribution, and more noise data, when the dimension is very high, there may also be a "from becoming zero phenomena", that is the distance between the points farthest away and recent from the given data decreases gradually with the dimension increasing. The clustering algorithm of numerical data cannot be easily applied to categorical data with its lack of inherent geometric model. Therefore, clustering algorithm of categorical data has been a very important research, and has attracted wide attention.This paper, under the clustering algorithm framework of fuzzy K-Means and fuzzy K-Modes, introduces divide and conquer to make studies on clustering algorithm of large data sets and categorical data. Research results are as follows:(1) The clustering method for large scale data set based on divide and conquer is to divided the data sets into several subsets, and simultaneously cluster each subset, then merge cluster results of each subset, finally coming the last clustering results. This method overcomes the weakness of "from becoming zero phenomena", which may be created by abundant data and high dimensions of large scale data. In addition, the complexity of clustering is reduced due to the decomposition of large-scale data for small-scale data. This method is carried out on the artificial data sets and the experimental results show that the clustering method for large data sets based on divide and conquers is effective.(2) The clustering method for categorical data sets based on divide and conquer is to method apply divide and conquer to fuzzy K-Modes clustering algorithm, divide large and complex data sets into several smaller subsets and cluster them. And then, concretize the clustering results of subsets to obtain the final clustering results. This method overcomes the lack of geometric model brought by categorical data with simple 0-1 match similarity measure and avoid "from becoming zero phenomena", from becoming zero phenomena, caused by large-scale data sets. This method make a comparison in UCI data sets with traditional clustering algorithm of fuzzy K-Means and fuzzy K-Modes, and the experimental results show that the clustering method for categorical data sets based on divide and conquer is effective.Paper proposes clustering algorithm based on divide and conquer and also proves the effectiveness of the algorithm in the UCI data sets.
Keywords/Search Tags:Cluster analysis, Divided and conquered method, Categorical data, Dissimilarity measure, Evaluation index
PDF Full Text Request
Related items