Font Size: a A A

Parallel Processing And Optimization Of Database Query Under Distributed Architecture

Posted on:2022-10-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:X WeiFull Text:PDF
GTID:1488306722971219Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
In the era of mobile Internet,the data generated by all walks of life has become an important factor of production,which contains great value.However,in the face of the growing scale of data and the continuous influx of analysis requests,the database query under the stand-alone architecture cannot use the limited hardware resources to meet the data analysis needs of low latency and high throughput.To this end,the academia and industry began trying to deploy the database query on the horizontally scalable distributed architecture,thereby improving the performance of database query by using the abundant hardware resources in the distributed architecture.Considering the difference of hardware environment between stand-alone architecture and distributed architecture,the traditional query parallel processing method cannot make full use of the considerable hardware resources in the distributed architecture,leading to the low performance improvement.Therefore,how to make full use of the scalable hardware resources in the distributed architecture to improve the performance of database query is a worthwhile research topic.This dissertation studies the parallel processing technology of database query under distributed architecture,which mainly focuses on three aspects: scanning operator,single query and multi query,and focuses on the following three problems: Firstly,for the scan operator,when the target data is gathered on some nodes,multiple scan subtasks generated by the existing parallel scan strategy will also accumulate on these nodes,but the data replications on other nodes are ignored so that the parallel scan ability among multiple nodes in the distributed architecture can not be fully utilized.Secondly,for a single database query,the distributed shared memory architecture can abstract the memory resources on different nodes into a shared memory space,so that the query processing scheme under the traditional stand-alone architecture can run on the distributed shared memory.However,compared with the stand-alone architecture,the distributed shared memory architecture is still a loosely-coupled implementation that depends on network connection,which is easy to lead to a large amount of network communication overhead in the query processing process,thereby affecting the query processing performance.Finally,for multiple database queries,some similar query requests often have the same query operators,which is easy to build highly overlapping data structures,resulting in the waste of computing resources and memory resources in the distributed architecture,which affects the parallel processing performance of multiple queries.Based on the above three key issues,the main work and contributions of this dissertation are as follows:(1)The parallel scan scheme based on multiple replications in distributed shared storage:In the shared storage architecture,the data table will be divided into multiple data partitions which are stored on different nodes.Therefore,when a scan operator involves multiple data partitions,the operator can also be divided into multiple scan subtasks to access these data partitions in parallel.However,when some nodes have more data partitions,the scan subtasks will also gather on these nodes,so it can only realize parallel scan between limited nodes.To this end,this dissertation proposes a parallel scan scheme based on multiple replications.Its core idea is to make use of the data parallelism between replications,so that the scan subtask for a single partition is further divided into scan subtasks for multiple replications,so as to run on different replications of different nodes in parallel,and then make full use of the parallel scan ability between multiple nodes.On the basis of multi-replication parallel scan,this dissertation also proposes a scan partition strategy based on linear programming model,so that different nodes bear similar scan loads.In addition,this dissertation also designs a set of multi-threaded parallel scheduling strategy for each node to ensure the load balance between multiple threads in the node during parallel scan,and minimize the task switching overhead in a single thread.(2)The network sensitive query processing framework based on distributed shared memory architecture:The distributed shared memory architecture breaks the memory resource isolation between machine nodes and exposes the memory on different machine nodes to a unified address space,so that the query processing scheme under the single machine architecture can be quickly deployed to the distributed architecture.However,compared with single machine shared memory architecture,distributed shared memory architecture is still a loosely-coupled implementation with the help of network connection.Therefore,when the traditional query processing scheme is deployed on the distributed shared memory architecture,some common read and write operations may also frequently trigger cross-node memory accesses,resulting in expensive network communication overhead.Therefore,this paper proposes a network sensitive query processing framework.In order to avoid the network bandwidth bottleneck in the query processing process,this dissertation designs an interleaving scheduling strategy for the framework,so that the query pipeline with frequent network communication can interleave with other pipelines,so as to reduce the network bandwidth consumption in the query processing process.In addition,in order to minimize the number of cross-node memory accesses with high delay in query processing,this dissertation also designs a pipeline subtask allocation strategy based on weighted bipartite graph for the framework,so that the pipeline processing has high data locality.(3)The chunk-based hash table reusing scheme for multiple hash-join queries:In the data analysis scenario,frequently used hash-join queries often build similar hash table data structures,which wastes a lot of computing resources and memory resources.Therefore,this dissertation attempts to use the large memory characteristics of the distributed architecture to cache the hash table data structures constructed by previous hash-join queries,so that subsequent queries can reuse these data structures,so as to reduce the resource overhead during multi-query parallel processing and improve the parallel processing performance of multiple queries.Different from the existing cache reusing schemes,the hash table reusing scheme in this dissertation adopts the chunk-based management mechanism,so that each hash table in the cache is divided into multiple non-overlapping chunk-based hash tables.Therefore,subsequent queries can flexibly reuse those matching chunk-based hash tables according to their own predicate conditions,so as to improve the hash table reusing rate under complex predicate conditions.In addition,this dissertation also proposes an efficient scheduling strategy,which can flexibly select different scheduling methods from the perspective of query latency and throughput to better coordinate the conflict between chunk-based hash table reusing and cache update.To sum up,this dissertation deeply studies the parallel processing of database query under distributed architecture,further discusses the optimization points of parallel processing from the perspective of scanning operator,single query and multi query,and designs the specific optimization schemes.
Keywords/Search Tags:Database Query, Scan Operator, Single-Query Processing, Multi-Query Processing, Distributed Architecture
PDF Full Text Request
Related items