Font Size: a A A

Building compressed database systems

Posted on:2003-03-22Degree:Ph.DType:Thesis
University:Cornell UniversityCandidate:Chen, ZhiyuanFull Text:PDF
GTID:2468390011989735Subject:Computer Science
Abstract/Summary:
Over the last decades, improvements in CPU speed have outpaced improvements in disk access rates by orders of magnitude, motivating the use of data compression techniques in database systems to trade reduced disk I/O against additional CPU overhead for compression and decompression of data. In this thesis, we study how to build compressed database systems that achieve better performance than conventional database systems. We address the following four issues.; First, we study how to compress string data stored in database systems. Despite the abundance of string-valued attributes in relational schemas, there is little work on compression for string attributes in a database context. We propose a Hierarchical Dictionary Encoding strategy that intelligently selects the most effective compression method for string-valued attributes, and achieves significant performance improvement over standard compression techniques such as Lempel-Ziv.; Second, we examine how to run queries efficiently on compressed data. None of the previous work suitably addresses the role of the query optimizer: during query execution, data are either eagerly decompressed when they are read into main memory, or data lazily stay compressed in main memory and are decompressed only on demand. We introduce the compression-aware query optimization problem and propose two algorithms: a provably optimal algorithm and a fast algorithm that almost always generates optimal execution strategies. Experiments using TPC-H data demonstrate that our approach results in up to an order speed-up over existing approaches.; Third, we study the problem of automatically selecting compression methods based on a query workload, which is crucial for the performance and usability of compressed database systems. We examine the complexity of the problem and propose a solution that conducts a cost-based search on possible compression configurations. Our experiments using both real and TPC-H data demonstrate the effectiveness of our approach.; Finally, we examine how to compress results shipped to clients that have limited bandwidth connections or severe storage restrictions. We choose a combination of compression methods that use statistical and semantic information of the query results to enhance the effect of compression.
Keywords/Search Tags:Database systems, Compression, Query
Related items