Unified Memory Management And System Performance Optimization For GPU-Based Large-Scale Analytical Query Processing

Posted on:2024-04-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J L Wang

Full Text:PDF

GTID:1528307145495804

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As data volume grows,large-scale data analytics is becoming increasingly important to society’s development.In recent years,the development of co-processors represented by GPUs(Graphics Processing Units)has created new possibilities for improving the data processing capabilities of single machines.However,limited device memory and PCIe bandwidth are the main bottlenecks for GPU-based large-scale analytical query processing.NVIDIA provides UM(Unified Memory)on its parallel computing platform CUDA(Compute Unified Device Architecture),which uses host memory to expand GPUaccessible memory space,and supports implicit CPU-GPU data migration.Even though the UM simplifies heterogeneous memory management,it has poor performance when dealing with large datasets.Most existing research efforts concentrate exclusively on either the upper layer,such as query plans and operators,or the lower layer,such as GPU driver optimizations.However,they do not address UM-based data management at runtime in a comprehensive manner that considers both layers’ characteristics.Based on a self-adaptive and low-coupling system design,this paper provides efficient UM management for large-scale data,taking into account both the analytical query processing and the UM to optimize system performance.This paper contributes to the following three aspects:(1)Boosting host-UM data transfer:Replacing physical device memory with UM can expand GPU-accessible memory space.However,to maintain compatibility with the other modules of the original system,explicit data migration between pageable memory and UM is still required.The host-UM data transfer speed is affected by data block sizes,parallelism,and CPU-GPU heterogeneous resource competition.In this paper,a data transfer acceleration module called D-Cubicle is designed and implemented based on the analysis of these factors.D-Cubicle groups the data blocks according to their original sizes,optimizes the transfer strategy for each group,and then collects the feedback information of data transfer in real time.It uses a hybrid exponential and linear way to dynamically adjust the number of threads used for data transfer,which quickly adapts to the changes of system resource competition and maintains a high actual transfer speed.Besides,a two-stage transfer strategy is proposed,utilizing the host-side UM space as the main buffer and reducing redundant PCIe data transfer by on-demand migration.With the help of D-Cubicle,and when data transfer time is included,the average performance of the optimized system increases to 1.43x,and the maximum performance can reach 2.09x.(2)Optimizing heterogeneous data placement in UM buffers:In large-scale data analytics,in addition to accelerating the transfer phase,it is necessary to optimize the UM access performance during the computation phase.In traditional GPU-accelerated systems,the GPU processes data directly in the physical device memory,but when using the UM,data is dynamically distributed in host memory or device memory,and there is an order of magnitude difference in bandwidth between these two different memory spaces.In consideration of device memory usage in analytical queries and UM’s oversubscription,UBM(Unified memory Buffer Manager)is designed and implemented.UBM uses the UM slab as the basic management unit for heterogeneous memory resources,and a self-adaptive adjustment algorithm for heterogeneous data placement is designed to make full use of the physical device memory as well as reduce the expensive implicit data migration caused by device memory exhaustion.In the algorithm,a table containing stepwise decreasing device memory ratios is used as the adjustment reference.The host-device data placement is dynamically changed for each query by adjusting the corresponding positions of UM slabs within the table.Moreover,UBM designs a dynamic data pre-eviction strategy to reserve free memory for subsequent queries based on the actual physical device memory utilization.With different query sequences,the optimized system achieves an average performance improvement of 32.5%～157.3%.(3)Optimizing UM management in single-machine-multi-GPU environment:In a single-machine-multi-GPU environment,data management of multiple GPUs needs to be coordinated to avoid performance drops caused by prolonged processing of a certain GPU.Further,the system should take advantage of the runtime information provided by multiple GPUs to design a more efficient self-adaptive management strategy for the UM and avoid excessive manual parameters.Our work analyzes the performance bottleneck in single-machine-multi-GPU query processing.We design and implement a single-machine-multi-GPU UM management optimizer called UMates that uses unified batch data pre-eviction to avoid severe blocking.UMates also properly differentiates the GPU-side data ratio of the major and auxiliary GPUs to detect the impact of different device memory ratios on actual device memory utilization,and uses it as reference information for self-adaptive data placement adjustment,improving the system’s automation.In experiments,when processing queries suitable for GPU acceleration,through the major-auxiliary cooperative data pre-eviction and placement strategies,UMates improves average performance by 105.4%and 66.2%compared with the original system and the prior self-adaptive strategy,respectively.In summary,this paper concentrates on runtime UM management between the analytical processing engine and the GPU driver from the perspectives of data transfer,data placement in heterogeneous memory,data pre-eviction,and single-machine-multi-GPU cooperation.The system modules and algorithms designed and implemented optimize the performance for GPU-based large-scale analytical query processing.

Keywords/Search Tags:

Data Analytics, GPU, Unified Memory, Data Management, Heterogeneous System

PDF Full Text Request

Related items

1	Research On GPU Efficient Data Reuse Mechanism
2	Efficient Data Migration For Heterogeneous Memory System
3	Systems for delivering electric vehicle data analytics
4	A Study On Heterogeneous Data Resource Integration And Unified Retrieval System
5	A study of an in-memory database system for real-time analytics on semi-structured data streams
6	Analytics by degree: The dilemmas of big data analytics in lasting university/corporate partnerships
7	Heterogeneous architecture for Big Data analytics
8	Design And Development Of Data Integration Middleware For Multi-source Heterogeneous Data
9	The Design And Implemention Of Expert Analytics Assisting System Based On Large Data
10	Key Techniques On Data Stream Speculation For Heterogeneous Multi-Core Digital Signal Processors