Font Size: a A A

Unified Memory Management And System Performance Optimization For GPU-Based Large-Scale Analytical Query Processing

Posted on:2024-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L WangFull Text:PDF
GTID:1528307145495804Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As data volume grows,large-scale data analytics is becoming increasingly important to society’s development.In recent years,the development of co-processors represented by GPUs(Graphics Processing Units)has created new possibilities for improving the data processing capabilities of single machines.However,limited device memory and PCIe bandwidth are the main bottlenecks for GPU-based large-scale analytical query processing.NVIDIA provides UM(Unified Memory)on its parallel computing platform CUDA(Compute Unified Device Architecture),which uses host memory to expand GPUaccessible memory space,and supports implicit CPU-GPU data migration.Even though the UM simplifies heterogeneous memory management,it has poor performance when dealing with large datasets.Most existing research efforts concentrate exclusively on either the upper layer,such as query plans and operators,or the lower layer,such as GPU driver optimizations.However,they do not address UM-based data management at runtime in a comprehensive manner that considers both layers’ characteristics.Based on a self-adaptive and low-coupling system design,this paper provides efficient UM management for large-scale data,taking into account both the analytical query processing and the UM to optimize system performance.This paper contributes to the following three aspects:(1)Boosting host-UM data transfer:Replacing physical device memory with UM can expand GPU-accessible memory space.However,to maintain compatibility with the other modules of the original system,explicit data migration between pageable memory and UM is still required.The host-UM data transfer speed is affected by data block sizes,parallelism,and CPU-GPU heterogeneous resource competition.In this paper,a data transfer acceleration module called D-Cubicle is designed and implemented based on the analysis of these factors.D-Cubicle groups the data blocks according to their original sizes,optimizes the transfer strategy for each group,and then collects the feedback information of data transfer in real time.It uses a hybrid exponential and linear way to dynamically adjust the number of threads used for data transfer,which quickly adapts to the changes of system resource competition and maintains a high actual transfer speed.Besides,a two-stage transfer strategy is proposed,utilizing the host-side UM space as the main buffer and reducing redundant PCIe data transfer by on-demand migration.With the help of D-Cubicle,and when data transfer time is included,the average performance of the optimized system increases to 1.43x,and the maximum performance can reach 2.09x.(2)Optimizing heterogeneous data placement in UM buffers:In large-scale data analytics,in addition to accelerating the transfer phase,it is necessary to optimize the UM access performance during the computation phase.In traditional GPU-accelerated systems,the GPU processes data directly in the physical device memory,but when using the UM,data is dynamically distributed in host memory or device memory,and there is an order of magnitude difference in bandwidth between these two different memory spaces.In consideration of device memory usage in analytical queries and UM’s oversubscription,UBM(Unified memory Buffer Manager)is designed and implemented.UBM uses the UM slab as the basic management unit for heterogeneous memory resources,and a self-adaptive adjustment algorithm for heterogeneous data placement is designed to make full use of the physical device memory as well as reduce the expensive implicit data migration caused by device memory exhaustion.In the algorithm,a table containing stepwise decreasing device memory ratios is used as the adjustment reference.The host-device data placement is dynamically changed for each query by adjusting the corresponding positions of UM slabs within the table.Moreover,UBM designs a dynamic data pre-eviction strategy to reserve free memory for subsequent queries based on the actual physical device memory utilization.With different query sequences,the optimized system achieves an average performance improvement of 32.5%~157.3%.(3)Optimizing UM management in single-machine-multi-GPU environment:In a single-machine-multi-GPU environment,data management of multiple GPUs needs to be coordinated to avoid performance drops caused by prolonged processing of a certain GPU.Further,the system should take advantage of the runtime information provided by multiple GPUs to design a more efficient self-adaptive management strategy for the UM and avoid excessive manual parameters.Our work analyzes the performance bottleneck in single-machine-multi-GPU query processing.We design and implement a single-machine-multi-GPU UM management optimizer called UMates that uses unified batch data pre-eviction to avoid severe blocking.UMates also properly differentiates the GPU-side data ratio of the major and auxiliary GPUs to detect the impact of different device memory ratios on actual device memory utilization,and uses it as reference information for self-adaptive data placement adjustment,improving the system’s automation.In experiments,when processing queries suitable for GPU acceleration,through the major-auxiliary cooperative data pre-eviction and placement strategies,UMates improves average performance by 105.4%and 66.2%compared with the original system and the prior self-adaptive strategy,respectively.In summary,this paper concentrates on runtime UM management between the analytical processing engine and the GPU driver from the perspectives of data transfer,data placement in heterogeneous memory,data pre-eviction,and single-machine-multi-GPU cooperation.The system modules and algorithms designed and implemented optimize the performance for GPU-based large-scale analytical query processing.
Keywords/Search Tags:Data Analytics, GPU, Unified Memory, Data Management, Heterogeneous System
PDF Full Text Request
Related items