Font Size: a A A

Optimizing Query Processing In Distributed In-Memory Databases

Posted on:2016-02-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WangFull Text:PDF
GTID:1228330461474107Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In retrospect to the evolution of database systems in the past forty years, it is indicated that, the evolution of database systems is always driven by the growing demand of data storage and ma-nipulation in real applications, and the database architectures are always shaped by the computer hardware that is developing and evolving over time. Nowadays, increasing number of new appli-cations, such as smart grids, smart traffic, etc., are collecting huge amount of data in every second and seek help from database systems for real-time data analysis. Traditional database systems, unfortunately, fail to provide efficient data analysis over such large amount of data, largely due to the expensive disk I/Os. Thanks to the technique enhancement in hardware manufacturing, the price of commodity servers equipped with large capacity of RAM and powerful multiprocessors has dropped dramatically. It is thus technically and economically feasible for the database sys-tems to load the entire data set in memory and expect dramatic performance improvement. As a result, many in-memory databases, such as MonetDB, Vectorwise, SAP HANA, Hyper, etc., have emerged in last decade and in-memory data processing has recently become a hot spot in research community. While most existing work focuses on centralized in-memory databases, the scalability of a centralized system is inherently limited by the memory capacity and the number of CPU cores within a single server.Distributed in-memory databases, on the contrary, seem to be to an attractive solution to real-time analysis over large data set, due to the abundant memory and computation resources in the cluster. However, as the hardware envoriment in distributed in-memory databases is quite different from that in traditional databases, the traditional database techniques fail to fully exploit the hard-ware potentials in the cluster. In this dissertation, we focus on the query processing techniques in distributed in-memory databases. Our goal is to accelerate the query processing in the system by fully utilizing the hardware resources in the cluster. The major contributions of this dissertation are listed as follows.1. We conduct theoretical study on the query performance in distributed in-memory databases. The query cost model derived from the theoretical study points out that the performance bottleneck of the query processing in distributed in-memory databases could be highly dy-namic and varies greatly from queries to queries. Specifically, for a given query, the query performance could either hinge on the network data communication among nodes, or the in-memory data processing in a single node, or the imbalanced workload among the query nodes, depending on a numerous factors, such as query workload, data distribution, hardware features, and algorithm implementations. This finding motivates us to optimize the query processing in our system by the following three ways.2. We minimize the network communication cost in the query processing to bridge the per-formance gap between efficient in-memory data processing and the relatively slow network bandwidth in the system. For the queries that require extensive network communication, the query performance is bounded by the limited network data transmission. To tackle this prob-lem, we first propose a NMDT algorithm that could convert a given query into the query exe-cution plan with minimized network communication. Then we propose a novel data exchange implementation, to provide efficient, scalable and skew-resilient network data transmission among the query nodes.3. We discuss the factors limiting the performance of in-memory data processing in modern hardware and propose a NUMA-aware scalable efficient aggregation algorithm. In particular, we make a case study on the in-memory aggregation. It is shown that the cache capacity miss, cache coherence miss and the locking cost are the key factors limiting the performance of in-memory aggregation algorithm. Consequently, we proposed a NUMA-aware radix partition method which divides the input huge relation table into subsets to improve the Cache locality, without invoking expensive remote memory access across NUMA sockets. We also present a new efficient aggregation algorithm, to aggregate the partitioned data in parallel with low cache coherence miss and locking costs.4. We propose a framework of elastic pipelining to tackle the problem of computation assignment in the cluster. It is performance-critical for the system to provision appropriate computation resources to the query execution, in order to minimize the query response time and make good utilization of the hardware resources. Such optimal computation allocation is unfortunately impossible under existing traditional iterator model, due to the unpredictable and fluctuating workload on each node and the static nature of parallelism in the traditional iterator model. To solve this problem, we propose elastic pipelining, which consists of an elastic iterator model and a dynamic scheduler. The elastic iterator model generally upgrades iterator model with new dynamic multi-core execution features. The dynamic scheduler dynamically provision CPU cores to the query execution segments based on instantaneous measurements.In summary, this dissertation systematically analyzes the performance bottleneck of the query processing in distributed in-memory databases and exploits optimization opportunities from the perspective of system architectures, algorithms, and implementations. Extensive experiments on both real data set and synthetic data set prove the effectiveness and efficiency of our proposals.
Keywords/Search Tags:Distributed In-Memory Database, Query Processing, Query Optimization, Scheduling, NUMA Architecture
PDF Full Text Request
Related items