Exploring similarities in high-dimensional datasets

Posted on:2006-05-22

Degree:Ph.D

Type:Dissertation

University:Rensselaer Polytechnic Institute

Candidate:Sequeira, Karlton

Full Text:PDF

GTID:1458390005492494

Subject:Computer Science

Abstract/Summary:

Very often, data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like confidentiality agreements, dataset size, etc. However, these sources may be willing to share a condensed representation of their datasets. If some subset of the condensed representations of such datasets, from different sources, may be found to be unusually similar, policies successfully applied to one may be considered for application to the others.; In this dissertation, we tackle the problem of finding similarities across high-dimensional datasets. We propose a framework, wherein we use condensed representations of the datasets to obfuscate details and limit noise. We provide algorithms to find interesting regions within datasets which become components of the condensed representations. We propose similarity measures for these components. We then use a graph-matching based formulation to find structurally similar components across the condensed representations of the datasets. As opposed to some earlier algorithms, we show that it is possible to match a subgraph in one graph to a subgraph in another.; We test our algorithms on a wide array of synthetic and real datasets. We make a number of discoveries. We find that structure-based similarity enhances amplification of weaker patterns. It allows discovery of patterns via integration of datasets having possibly differing schema.

Keywords/Search Tags:

Datasets, Condensed representations

Related items

1	Research On The Storage Of Condensed Cube Based On Flash Memory
2	Mining Condensed Sets Of Sequential Patterns And Structured Patterns
3	One-shot Voice Conversion Algorithm Design And Implementation Based On Representations Separation
4	Research On The Efficient Materialization And Fast Query Of Condensed Data Cube
5	Analysis of precision agriculture datasets for on-farm research
6	Joint time-frequency representations of nonstationary signals
7	Discovering and ranking outliers in very large datasets
8	Incorporating indexicality and contingency into the design of representations for computer-mediated collaboration
9	Fractal Analysis Of Datasets Using Distributed Computing
10	Non-stationary analysis on datasets and applications