Parallel Processing And Optimization Of Database Query Under Distributed Architecture

Posted on:2022-10-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Wei

Full Text:PDF

GTID:1488306722971219

Subject:Software Engineering

Abstract/Summary:

PDF Full Text Request

In the era of mobile Internet,the data generated by all walks of life has become an important factor of production,which contains great value.However,in the face of the growing scale of data and the continuous influx of analysis requests,the database query under the stand-alone architecture cannot use the limited hardware resources to meet the data analysis needs of low latency and high throughput.To this end,the academia and industry began trying to deploy the database query on the horizontally scalable distributed architecture,thereby improving the performance of database query by using the abundant hardware resources in the distributed architecture.Considering the difference of hardware environment between stand-alone architecture and distributed architecture,the traditional query parallel processing method cannot make full use of the considerable hardware resources in the distributed architecture,leading to the low performance improvement.Therefore,how to make full use of the scalable hardware resources in the distributed architecture to improve the performance of database query is a worthwhile research topic.This dissertation studies the parallel processing technology of database query under distributed architecture,which mainly focuses on three aspects: scanning operator,single query and multi query,and focuses on the following three problems: Firstly,for the scan operator,when the target data is gathered on some nodes,multiple scan subtasks generated by the existing parallel scan strategy will also accumulate on these nodes,but the data replications on other nodes are ignored so that the parallel scan ability among multiple nodes in the distributed architecture can not be fully utilized.Secondly,for a single database query,the distributed shared memory architecture can abstract the memory resources on different nodes into a shared memory space,so that the query processing scheme under the traditional stand-alone architecture can run on the distributed shared memory.However,compared with the stand-alone architecture,the distributed shared memory architecture is still a loosely-coupled implementation that depends on network connection,which is easy to lead to a large amount of network communication overhead in the query processing process,thereby affecting the query processing performance.Finally,for multiple database queries,some similar query requests often have the same query operators,which is easy to build highly overlapping data structures,resulting in the waste of computing resources and memory resources in the distributed architecture,which affects the parallel processing performance of multiple queries.Based on the above three key issues,the main work and contributions of this dissertation are as follows:(1)The parallel scan scheme based on multiple replications in distributed shared storage:In the shared storage architecture,the data table will be divided into multiple data partitions which are stored on different nodes.Therefore,when a scan operator involves multiple data partitions,the operator can also be divided into multiple scan subtasks to access these data partitions in parallel.However,when some nodes have more data partitions,the scan subtasks will also gather on these nodes,so it can only realize parallel scan between limited nodes.To this end,this dissertation proposes a parallel scan scheme based on multiple replications.Its core idea is to make use of the data parallelism between replications,so that the scan subtask for a single partition is further divided into scan subtasks for multiple replications,so as to run on different replications of different nodes in parallel,and then make full use of the parallel scan ability between multiple nodes.On the basis of multi-replication parallel scan,this dissertation also proposes a scan partition strategy based on linear programming model,so that different nodes bear similar scan loads.In addition,this dissertation also designs a set of multi-threaded parallel scheduling strategy for each node to ensure the load balance between multiple threads in the node during parallel scan,and minimize the task switching overhead in a single thread.(2)The network sensitive query processing framework based on distributed shared memory architecture:The distributed shared memory architecture breaks the memory resource isolation between machine nodes and exposes the memory on different machine nodes to a unified address space,so that the query processing scheme under the single machine architecture can be quickly deployed to the distributed architecture.However,compared with single machine shared memory architecture,distributed shared memory architecture is still a loosely-coupled implementation with the help of network connection.Therefore,when the traditional query processing scheme is deployed on the distributed shared memory architecture,some common read and write operations may also frequently trigger cross-node memory accesses,resulting in expensive network communication overhead.Therefore,this paper proposes a network sensitive query processing framework.In order to avoid the network bandwidth bottleneck in the query processing process,this dissertation designs an interleaving scheduling strategy for the framework,so that the query pipeline with frequent network communication can interleave with other pipelines,so as to reduce the network bandwidth consumption in the query processing process.In addition,in order to minimize the number of cross-node memory accesses with high delay in query processing,this dissertation also designs a pipeline subtask allocation strategy based on weighted bipartite graph for the framework,so that the pipeline processing has high data locality.(3)The chunk-based hash table reusing scheme for multiple hash-join queries:In the data analysis scenario,frequently used hash-join queries often build similar hash table data structures,which wastes a lot of computing resources and memory resources.Therefore,this dissertation attempts to use the large memory characteristics of the distributed architecture to cache the hash table data structures constructed by previous hash-join queries,so that subsequent queries can reuse these data structures,so as to reduce the resource overhead during multi-query parallel processing and improve the parallel processing performance of multiple queries.Different from the existing cache reusing schemes,the hash table reusing scheme in this dissertation adopts the chunk-based management mechanism,so that each hash table in the cache is divided into multiple non-overlapping chunk-based hash tables.Therefore,subsequent queries can flexibly reuse those matching chunk-based hash tables according to their own predicate conditions,so as to improve the hash table reusing rate under complex predicate conditions.In addition,this dissertation also proposes an efficient scheduling strategy,which can flexibly select different scheduling methods from the perspective of query latency and throughput to better coordinate the conflict between chunk-based hash table reusing and cache update.To sum up,this dissertation deeply studies the parallel processing of database query under distributed architecture,further discusses the optimization points of parallel processing from the perspective of scanning operator,single query and multi query,and designs the specific optimization schemes.

Keywords/Search Tags:

Database Query, Scan Operator, Single-Query Processing, Multi-Query Processing, Distributed Architecture

PDF Full Text Request

Related items

1	Research On Key Technologies Of Distributed Rank-aware Query Processing
2	Optimizing Query Processing In Distributed In-Memory Databases
3	Research On Data Query Processing And Optimization In Distributed Database
4	Research On Query Processing Technologies Over Large Scale Knowledge Graphs
5	Query Processing And Optimization In Massive Multi-Database Integration
6	Research On Data Query Optimization Algorithm Of Distributed Database
7	Research On Distributed Query Processing And Optimization Of RDF Data
8	Research On Key Techniques Of Query Processing In Spatio-Temporal Databases
9	Research On Query Language And Query Processing In DM_OLAP
10	Research On The Query Processing Technologies Of Distributed Data Stream