Data reduction in data warehouses

Posted on:2004-05-11

Degree:Ph.D

Type:Dissertation

University:University of Toronto (Canada)

Candidate:Palpanas, Themistoklis

Full Text:PDF

GTID:1458390011955909

Subject:Computer Science

Abstract/Summary:

Much research has been devoted to the efficient computation of relational aggregations. In this paper we consider the inverse problem, that of deriving (approximately) the original data from the aggregates.; We motivate this problem in the context of two specific application areas, approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We then show how approximate queries on the data from which the aggregates were derived can be performed using our framework. We also describe an alternate use of the proposed framework that enables us to identify values that deviate from the underlying data distribution, suitable for data ruining purposes.; We present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions.; Subsequently, we consider the above problem in a space constrained environment, where only a subset of the required aggregates can be stored. More specifically, we wish to select a set of aggregates of interest, subject to a constraint on the total space occupied by these aggregates. The objective is to maximize the total benefit, which is a function of the number and importance of the queries that we can estimate given the selected aggregates.; For this problem, which as we show is NP-hard, there are no known polynomial approximation schemes, and we propose several algorithms for solving it. We explore the use of greedy and randomized techniques as well as clustering based approaches. The solutions presented herein are generic and can be applied to other problem domains as well.; We explore the properties and special characteristics of the above techniques with an experimental evaluation. Our results illustrate the behavior of the algorithms under different settings, and highlight the benefits of each approach. Based on our analysis, we present worst case scenarios for the algorithms. This offers insight into the operation of the algorithms, and provides a practical guide for selecting among the proposed techniques.

Keywords/Search Tags:

Data, Problem, Algorithms

Related items

1	Research On Some Models And Algorithms Of Network Optimization Problems In Uncertain Condition
2	Two Improved Cultural Algorithms And Their Applications
3	Research On Recommender Algorithms For Data Sparsity Problem
4	Theory and algorithms for modern problems in machine learning and an analysis of markets
5	Shandong Institute Of Commerce And Technology Course Scheduling System
6	Algorithms for managing data in distributed systems
7	Algorithms for NP-hard Optimization Problems and Cluster Analysi
8	A Study Of Algorithms For Job Shop Scheduling Problem
9	Pattern discovery in trees: Algorithms and applications to document and scientific data management
10	From The Discrete Geodesic Problem To The Dynamic Ordered Set Problem