A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure

Posted on:2014-02-20

Degree:Ph.D

Type:Thesis

University:University of Cincinnati

Candidate:Rawashdeh, Mohammad Y

Full Text:PDF

GTID:2458390008461197

Subject:Computer Science

Abstract/Summary:

By clustering one seeks to partition a given set of points into a number of clusters such that points in the same cluster are similar and are dissimilar to points in other clusters. In the virtue of this goal, data of relational nature become typical for clustering. The similarity and dissimilarity relations between the data points are supposed to be the nuts and bolts for cluster formation. Thus, the task is driven by the notion of similarity between the data points. In practice, the similarity is usually measured by the pairwise distances between the data points. Indeed, the objective function of the two widely used clustering algorithms, namely, k-means and fuzzy c-means, appears in terms of the pairwise distances between the data points.;The clustering task is complicated by the choice of the distance measure and estimating the number of clusters. Fuzzy c-means is convenient when there are uncertainties in allocating points, in overlapping areas, to clusters. The k-means algorithm allocates the points unequivocally to clusters; overlooking the similarities between those points in overlapping areas. The fuzzy approach allows a point to be a member in as many clusters as necessary; thus it provides better insight into the relations between the points in overlapping areas.;In this thesis we develop a relational framework that is inspired by the silhouette measure of clustering quality. The framework asserts the relations between the data points by means of logical reasoning with the cluster membership values. The original description of computing the silhouettes is limited to crisp partitions. A natural generalization of silhouettes, to fuzzy partitions is given within our framework. Moreover, two notions of silhouettes emerge within the framework at different levels of granularity, namely, point-wise silhouette and center-wise silhouette. Now by the generalization, each silhouette is capable of measuring the extent to which a crisp, or fuzzy, partition has fulfilled the clustering goal at the level of the individual points, or cluster centers. The partitions are evaluated by the silhouette measure in conjunction with point-to-point or center-to-point distances.;By the generalization, the average silhouette value becomes a reasonable device for selecting between crisp and fuzzy partitions of the same data set. Accordingly, one can find about which partition is better in representing the relations between the data points, in accordance with their pairwise distances. Such powerful feature of the generalized silhouettes has exposed a problem with the partitions generated by fuzzy c-means. We have observed that defuzzifying the fuzzy c-means partitions always improves the overall representation of the relations between the data points. This is due to the inconsistency between some of the membership values and the distances between the data points. This inconsistency was reported, by others, in a couple of occasions in real life applications.;Finally, we present an experiment that demonstrates a successful application of the generalized silhouette measure in feature selection for highly imbalanced classification. A significant improvement in the classification for a real data set has resulted from a significant reduction in the number of features.

Keywords/Search Tags:

Clustering, Points, Silhouette, Framework, Generalization, Fuzzy c-means, Relational

Related items

1	Research Of New Fuzzy Clustering Algorithms Based On Objective Function And Its Applications
2	Research On Hierarchical Clustering Algorithm Based On Silhouette
3	The Application Of Fuzzy C-means Clustering In The Stock Investment
4	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel
5	Research Of Key Techniques In Fuzzy Clustering Based On Objective Function
6	Improved Fuzzy C Means Clustering Algorithm And Its Application
7	Studies On New Fuzzy Clustering Algorithms And Clustering Validity Problems
8	Research On K-means Clustering Algorithm Based On Differential Privacy Protection
9	Study Of Auto-Adaption Fuzzy C-Means Clustering Algorithm
10	Probabilistic K-means Models Via Nonlinear Programming