Font Size: a A A

Stochastic and iterative techniques for relational data clustering

Posted on:2010-09-13Degree:Ph.DType:Dissertation
University:University of Maryland, Baltimore CountyCandidate:Anthony, Adam PFull Text:PDF
GTID:1448390002477147Subject:Computer Science
Abstract/Summary:
This dissertation focuses on the topic of relational data clustering, which is the task of organizing objects into logical groups, or clusters, taking into account the relational links between objects. As a research area, relational clustering has received a great deal of attention recently, because of the large variety of social media applications and other modern relational data sources that have become popular, such as weblogs, protein interaction networks, social networks, and citation graphs. The contributions of this dissertation are in three areas: probabilistic clustering algorithms, iterative clustering algorithms, and multi-relational clustering algorithms.;The probabilistic clustering algorithms are presented as a general framework and allow for the highest level of expression in developing models for relational clustering. Specifically, I show that one particular part of the clustering model---the relation existence distribution---can be modified in a variety of ways. I give two example instantiations of the framework: one that uses objects' distance from their cluster mean to adjust link probabilities, and another that measures the correlation of paired covariate values to determine link probabilities. I demonstrate the properties of each model on both artificial data and two social networks. The results show the potential of the relational probability function in developing useful clustering models.;The iterative algorithm presented in this dissertation, BMOD, trades off model expressiveness for speed and can be applied to much larger data sets, scaling up to several thousand objects. I present empirical results showing that BMOD performs similarly to two related probabilistic models, while also being much faster. The model is applied to a variety of relational data sets including a protein interaction network, a citation graph, and a social network.;The results for the BMOD algorithm led to a multi-relational approach: M-BMOD, which generalizes the clustering model for multiple relation types. The M-BMOD algorithm is shown to perform poorly when low-quality data is present, so an approach for relation selection, or the process of identifying the most relevant relational information out of a larger set of different relation types, is presented. I show that the approach effectively identifies relations that mutually agree on an optimal clustering while also filtering out any relations that will mislead the algorithm.
Keywords/Search Tags:Clustering, Relational, Iterative
Related items