Font Size: a A A

Data reduction in data warehouses

Posted on:2004-05-11Degree:Ph.DType:Dissertation
University:University of Toronto (Canada)Candidate:Palpanas, ThemistoklisFull Text:PDF
GTID:1458390011955909Subject:Computer Science
Abstract/Summary:
Much research has been devoted to the efficient computation of relational aggregations. In this paper we consider the inverse problem, that of deriving (approximately) the original data from the aggregates.; We motivate this problem in the context of two specific application areas, approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We then show how approximate queries on the data from which the aggregates were derived can be performed using our framework. We also describe an alternate use of the proposed framework that enables us to identify values that deviate from the underlying data distribution, suitable for data ruining purposes.; We present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions.; Subsequently, we consider the above problem in a space constrained environment, where only a subset of the required aggregates can be stored. More specifically, we wish to select a set of aggregates of interest, subject to a constraint on the total space occupied by these aggregates. The objective is to maximize the total benefit, which is a function of the number and importance of the queries that we can estimate given the selected aggregates.; For this problem, which as we show is NP-hard, there are no known polynomial approximation schemes, and we propose several algorithms for solving it. We explore the use of greedy and randomized techniques as well as clustering based approaches. The solutions presented herein are generic and can be applied to other problem domains as well.; We explore the properties and special characteristics of the above techniques with an experimental evaluation. Our results illustrate the behavior of the algorithms under different settings, and highlight the benefits of each approach. Based on our analysis, we present worst case scenarios for the algorithms. This offers insight into the operation of the algorithms, and provides a practical guide for selecting among the proposed techniques.
Keywords/Search Tags:Data, Problem, Algorithms
Related items