Font Size: a A A

Design And Implementation Of Hash Join In Separated Computing And Storage Database

Posted on:2022-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:2518306572497374Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of cloud technology,the demand for analyzing data on the cloud is constantly increasing.In analytical applications,connection query is the most complex and costly operation,which often involves more data,which will bring heavy disk input and output(I/O),and it will also bring heavy burden in the separation of computing and storage scenarios.Network I/O,thereby reducing the execution efficiency of hash connections.Therefore,it is of great significance to provide an optimized hash connection function for computing and storing separate databases.Starting from the optimization of the hash connection,facing the computing layer,the optimization method CEHJ(Calculate Layer Hash Join),which adopts the flexible selection of hash connection and index,is proposed.This method adds the hash join execution function to the original executor,so that the hash join can partially replace the block loop nested join;at the same time,in the cost estimation method of the hash join execution given,the difference between the two tables Equivalent connection,by analyzing the possibility of replacing index loop nested connection by hash connection,the optimizer can automatically choose whether to process a unit of equivalent connection by hash connection or index loop nested connection for processing.In order to reduce the network I/O of the tables participating in the hash join,start with the four parts of cost estimation,query push down,parallel execution of storage nodes,and data management,and perform query push down hash join optimization(SQL Push Down Hash Join,SPHJ)).The cost estimation part calculates the ratio of the sum of the sizes of all columns participating in the query execution of a single table to the size of the entire row of records,and determines whether it is necessary to perform single table pushdown optimization on a specific table.The query push-down part traverses the execution plan given by the optimizer,constructs the relevant SQL statement based on the identified information and sends it to the corresponding storage node,removes the conditions that have been pushed down in the original execution plan,and receives the execution result from the storage node.The original storage node is adjusted to support the selection projection operation,and the query result is returned to the data management part of the computing layer.The data management part constructs a new interface for the actuator to obtain data,supports caching of data and allows the actuator to obtain corresponding data through the interface.In order to make the execution of the hash join closer to the data,on the basis of the single-table query push-down optimization,the hash join push-down is added to the storage node for parallel execution.The related selection and projection and conditions of the tables participating in the hash join will also be The storage node executes,the execution result is returned to the computing layer as an intermediate result,and operations such as aggregation and sorting are executed in the computing layer.Using TPC-H to test the execution effect before and after optimization,the results show that under the premise of ensuring the correctness of the query return result,the execution efficiency of the hash join has been significantly improved,thereby improving the execution efficiency of the entire query,and the average execution time More than 10 times faster than before optimization.
Keywords/Search Tags:computing storage separation database, hash join, query pushdown, parallel computing
PDF Full Text Request
Related items