Font Size: a A A

Clustering in relational data and ontologies

Posted on:2011-08-24Degree:Ph.DType:Dissertation
University:University of Missouri - ColumbiaCandidate:Havens, Timothy CFull Text:PDF
GTID:1448390002468913Subject:Engineering
Abstract/Summary:
This dissertation studies the problem of clustering objects represented by relational data. This is a pertinent problem as many real-world data sets can only be represented by relational data for which object-based clustering algorithms are not designed. Relational data are encountered in many fields including biology, management, industrial engineering, and social sciences. Unlike numerical object data, which are represented by a set of feature values (e.g. height, weight, shoe size) of an object, relational object data are the numerical values of (dis)similarity between objects. For this reason, conventional cluster analysis methods such as k-means and fuzzy c-means cannot be used directly with relational data.;I focus on three main problems of cluster analysis of relational data: (i) tendency prior to clustering---how many clusters are there?; (ii) partitioning of objects---which objects belong to which cluster?; and (iii) validity of the resultant clusters ---are the partitions "good"? Analyses are included in this dissertation that prove that the Visual Assessment of cluster Tendency (VAT) algorithm has a direct relation to single-linkage hierarchical clustering and Dunn's cluster validity index. These analyses are important to the development of two novel clustering algorithms, CLODD-CLustering in Ordered Dissimilarity Data and ReSL-Rectangular Single-Linkage clustering.;Also presented in my analysis of VAT is a recursive formulation of the improved VAT (iVAT) algorithm. iVAT is shown to improve the visual evidence of cluster tendency on some types of data for which VAT fails. The computational complexity of my recursive formulation of iVAT is O (n2), as opposed to O( n3) of the original formulation---n is the number of objects considered.;CLODD is a clustering algorithm that works on reordered dissimilarity data. Typically, the dissimilarity matrix is reordered with the VAT algorithm; although, I present results on data that is reordered with other methods. CLODD produces partitions by minimizing a novel objective function that measures the fit of a candidate partition to the 'blockiness' of images of reordered dissimilarity data. Three characteristics are included in the objective function: 'blockiness' or constrast, 'edginess', and small-cluster rejection. CLODD is shown to extract good partitions of data when the reordering method is successful in producing "good" visualizations.;A special form of relational data is rectangular data. Rectangular relational data are dissimilarity values between a set of row objects and a set of column objects, where the relation among row objects (or column objects) is unknown. One example of this type of data set is ratings given by movie reviewers (the column objects) for a set of movies (the row objects). Each rating is interpreted as a dissimilarity value between a reviewer and a movie---high rating indicates low dissimilarity, while a low rating indicates high dissimilarity. Additionally the relation between reviewers (or between movies) is unknown.;I present a new formulation of the co-VAT algorithm, which a visual method for determining the cluster tendency of rectangular data. I provide several examples for which my new formulation is successful in showing the preferred cluster tendency and for which the original co-VAT fails. I also extend the notions in the iVAT algorithm to the co-VAT algorithm, which I call co-iVAT.;ReSL addresses a special form of relational data, rectangular data. ReSL extracts five types of clusters from rectangular relational data. I present several examples that show the behavior of ReSL in the presence of different types of rectangular data. A comparison is made with spectral co-clustering and ReSL is shown to more effective at elucidating the clustering structure.;Last, this dissertation addresses clustering in ontologies, a specific type of data. These data are composed of collections of terms organized in a hierarchical taxonomy (a directed acyclic graph); examples include the Gene Ontology, the MeSH ontology, patient medical records, and web documents. I apply an extension to the Self-Organizing Map (SOM) to produce a new algorithm, the OSOM-Ontological Self-Organizing Map. OSOM provides visualization and linguistic summarization of ontology-based data. A binary-valued vector-based network prototype is used to represent ontological objects (e.g. genes and gene products).;These algorithms and analyses are pertinent to all problems that produce relational data; however, I specifically examine these methods, especially the OSOM, for use on bioinformatics-based data. I show results of these algorithms with data composed of Gene Ontology annotations and microarray gene expression data, and with data available from the UCI Machine Learning Repository.
Keywords/Search Tags:Data, Clustering, Objects, VAT, Gene, Dissimilarity, Algorithm
Related items