Font Size: a A A

Similarity Measures In Cluster Analysis And Its Applications

Posted on:2013-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X BaiFull Text:PDF
GTID:1228330395967908Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In machine learning research, clustering is an unsupervised learning and draws a lot of attention from the fields of statistics, computer science, etc. It is not only an important part in data mining, but also a basis problem in pattern recognition. In cluster analysis, data elements are grouped according to a similarity measure. The goal of clustering is to maximize the similarity among elements in the same cluster, and maximize the dissimilarity between different clusters at the same time. Further more, due to the unsupervised nature of clustering; the validation of clustering results is also a popular research topic. We summarize that there are three types of problems concerning similarity measure in cluster analysis, i.e. the similarity between two data elements, the similarity between two clusters, and the similarity between two clustering results. Therefore, when we discuss image processing by cluster analysis, the measure of similarity is a critical problem.In the thesis, we first briefly introduce some definitions in cluster ananlysis, the whole clustering procedure, classification of clustering methods, the problem of similarity measure, and the applications of clustering in image processing. Then, based on the classic definitions in information theory, we discuss three types of similarity measures from the viewpoint of information. Furthermore, we verify the reasonability and effectiveness of our proposed methods in several image processing tasks, which are image clustering, contour grouping, further processing for over-segmentation, and segmentation evaluation. Our contributions in this thesis are listed as follows:First, similarity between two data elements (case1)—dealing with complex data by using Bregman divergence. To measure the similarity between two images, we need to consider two questions, i.e. how to represent the image data and how to evaluate the similarity between two image objects based on that representation.Under the information bottleneck principle, we propose to combine the Bag of Words model and Bregman divergence method for content-based image clustering with more semantic information. This approach features three points:images are represented by Bag of Words model which can catch more image content by utilizing various feature detectors, as well as provide the histogram representation based on image features; according to the information bottleneck method, the goal is to find clusters which minimize the reduction of mutual information between images and extracted features; the Bregman divergence algorithm is applied to minimize the loss of mutual information. The image clustering process based on information theoretic objective function works like k-means algorithm, and the KL divergence in Bregman method corresponds to the Euclidean distance in k-means.Second, similarity between two data elements (case2)—improve the robustness of clustering procedure through multi-feature similarity among multiple data elements. For contour grouping by clustering procedure, under the information-based clustering framework, we propose to calculate the collective similarity through multi-feature grouping cue rather than pairwise similarity, and this similarity measure is called multi-feature similarity. Then, we take the collective similarity values as input, and utilizing the information-based clustering method to group the extracted edge features. The experiment results show that, under the same noisy condition, the grouping quality is obviously improved by using multi-feature similarity compared with pairwise similarity.Third, similarity between two clusters—propose to apply the information potential and Renyi’s cross entropy, which are defined in information theoretic learning, to measuring the similarity between two clusters. For the segmentation algorithms which tend to produce oversegmented results, according to the cross entropy between two clusters, we can obtain the hierarchical clustering structure by agglomerative procedure, based on the initial clustering result with fine granularity. The experiment results show that, for three typical artificial datasets, the similarity measure based on Renyi’s cross entropy has better clustering results compared with three popular measures which are single linkage, complete linkage and average linkage. Further more, we test the performance of the proposed similarity measure on some color images.Fourth, similarity between two clustering results—extend traditional normalized mutual information to handle the case that machine clustering result are compared with multiple ground-truths. For image segmentation evaluation, there is always a set of manual segmentations for each image in a hand-labeled dataset, as different human subjects would produce different segmented results at various granularity levels. To include all the manully segmentation information, the similarity measures between two different clusterings should be extended to deal with multiple ground-truth images. So, we propose an information theoretic based measure, the Normalized Joint Mutual Information, which is an extension to the Normalized Mutual Information to handle that case. We illustrate the reasonability of Normalized Joint Mutual Information for objective segmentation evaluation with multiple ground-truth segmentations, by testing it on images from Berkeley segmentation dataset.
Keywords/Search Tags:cluster analysis, similarity measure, information bottleneck, information-based clustering, information theoretic learning, joint mutual information, image processing
PDF Full Text Request
Related items