Font Size: a A A

Exploring similarities in high-dimensional datasets

Posted on:2006-05-22Degree:Ph.DType:Dissertation
University:Rensselaer Polytechnic InstituteCandidate:Sequeira, KarltonFull Text:PDF
GTID:1458390005492494Subject:Computer Science
Abstract/Summary:
Very often, data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like confidentiality agreements, dataset size, etc. However, these sources may be willing to share a condensed representation of their datasets. If some subset of the condensed representations of such datasets, from different sources, may be found to be unusually similar, policies successfully applied to one may be considered for application to the others.; In this dissertation, we tackle the problem of finding similarities across high-dimensional datasets. We propose a framework, wherein we use condensed representations of the datasets to obfuscate details and limit noise. We provide algorithms to find interesting regions within datasets which become components of the condensed representations. We propose similarity measures for these components. We then use a graph-matching based formulation to find structurally similar components across the condensed representations of the datasets. As opposed to some earlier algorithms, we show that it is possible to match a subgraph in one graph to a subgraph in another.; We test our algorithms on a wide array of synthetic and real datasets. We make a number of discoveries. We find that structure-based similarity enhances amplification of weaker patterns. It allows discovery of patterns via integration of datasets having possibly differing schema.
Keywords/Search Tags:Datasets, Condensed representations
Related items