Data clustering with pairwise constraints

Posted on:2015-12-09

Degree:Ph.D

Type:Thesis

University:Michigan State University

Candidate:Yi, Jinfeng

Full Text:PDF

GTID:2478390017989688

Subject:Computer Science

Abstract/Summary:

The classical unsupervised clustering is an ill-posed problem due to the absence of a unique clustering criteria. This issue can be addressed by introducing additional supervised information, usually casts in the form of pairwise constraints, to the clustering procedure. Depending on the sources, most pairwise constraints can be classified into two categories: (i) pairwise constraints collected from a set of non-expert crowd workers, which leads to the problem of crowdclustering, and (ii) pairwise constraints collected from oracle or experts, which leads to the problem of semi-supervised clustering. In both cases, the costs of collecting pairwise constraints can be expensive, thus it is important to identify the minimal number of pairwise constraints needed to accurately recover the underlying true data partition, also known as a sample complexity problem.;In this thesis, we first analyze the sample complexity of crowdclustering. At first, we propose a novel crowdclustering approach based on the theory of matrix completion. Unlike the existing crowdclustering algorithm that is based on a Bayesian generative model, the proposed approach is more desirable since it only needs a much less number of crowdsourced pairwise annotations to accurately cluster all the objects. Our theoretical analysis shows that in order to accurately cluster N objects, only O (N log2 N) randomly sampled pairs should be annotated by crowd workers. To further reduce the sample complexity, we then introduce a semi-crowdsourced clustering framework that is able to effectively incorporate the low-level features of the objects to be clustered. In this framework, we only need to sample a subset of n << N objects and generate their pairwise constraints via crowdsourcing. After completing a n x n similarity matrix using the proposed crowdclustering algorithm, we can further recover a N x N similarity matrix by applying a regression-based distance metric learning algorithm to the completed smaller size similarity matrix. This enables us to reliably cluster N objects with only O(n log2 n) crowdsourced pairwise constraints.;Next, we study the problem of sample complexity in semi-supervised clustering. To this end, we propose a novel convex semi-supervised clustering approach based on the theory of matrix completion. In order to reduce the number of pairwise constraints needed to achieve a perfect data partitioning, we apply a nature assumption that the feature representations of the objects are able to reflect the similarities between objects. This enables us to only utilize O(log N) pairwise constraints to perfectly recover the data partition of N objects.;Lastly, in addition to sample complexity that relates to labeling costs, we also consider the computational costs of semi-supervised clustering. In addition to sample complexity that relates to the labeling costs, we also consider the computational cost of semi-supervised clustering in the final part of this thesis. Specifically, we study the problem of efficiently updating clustering results when the pairwise constraints are generated sequentially, a common case in various real-world applications such as social networks. To address this issue, we develop a dynamic semi-supervised clustering algorithm that casts the clustering problem into a searching problem in a feasible convex space, i.e., a convex hull with its extreme points being an ensemble of multiple data partitions. Unlike classical semi-supervised clustering algorithms that need to re-optimize their objective functions when new pairwise constraints are generated, the proposed method only needs to update a low-dimensional vector and its time complexity is irrelevant to the number of data points to be clustered. This enables us to update large-scale clustering results in an extremely efficient way.

Keywords/Search Tags:

Clustering, Pairwise constraints, Data, Problem, Sample complexity, Objects

Related items

1	Semi-supervised Clustering Algorithm And Implementation Based On Seeds Set And Pairwise Constraints
2	Research On Semi-supervised Clustering Algorithms With Pairwise Constraints
3	Study Of Semi-supervised Fuzzy Clustering Algorithm Based On Pairwise Constraints
4	Learning from and actively selecting pairwise constraints in data science
5	On Feature Selection, Kernel Learning and Pairwise Constraints for Clustering Analysis
6	Graph-based Clustering For Multivariate Time Series Data Using Pairwise Constraint Propagation
7	Semi-Supervised Clustering Algorithm Based On Pairwise Constraints And Its Parallel Implementation
8	Research On Deep Clustering With Pairwise Constraints
9	Studies On Semi-Supervised Clustering Algorithms Based On Pairwise Constraints
10	Research On Active Learning Algorithms Of Pairwise Constraints In Semi-supervised Clustering