Font Size: a A A

Automatic Generation Of Multi-task Streamed Code For Heterogeneous Systems

Posted on:2022-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:S S MinFull Text:PDF
GTID:2568307169983339Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Heterogeneous computing systems are more energy-efficient than homogeneous systems,and are widely used in compute-intensive applications.A heterogeneous system is generally composed of CPU and one or more acceleration devices.As the host of heterogeneous computing system,CPU is mainly responsible for managing some complex operations of accelerator,and the accelerating devices are mainly responsible for computing tasks.There are many challenges in how to make efficient use of heterogeneous computing systems,including how to minimize the data movement overhead betweenCPU and devices.Using different task queues in the multi-task streaming mechanism can make multiple tasks run concurrently,which can effectively hide the data movement overhead betweenCPU and the devices.The basic idea is to use the multi-task streaming mechanism to overlap kernel computing tasks and data transfer tasks,that is,when one task is executing the kernel,another task can transfer data at the same time.However,manually writing heterogeneous multi-task streaming programs is much more complicated than traditional heterogeneous programs,which requires more efforts from programmers.This thesis focuses on the automatic generation techniques of heterogeneous multitask streaming code,and carries out automatic performance optimization of the generated code,aiming to reduce the burden of manually writing multi-task streaming code.The main contributions of the thesis are as follows:(1)Automated generation of multi-task streaming code in OpenCL.Based on the dependency analysis of the loop in the input code,we analyze the data streaming in the loop that can be executed in parallel,divide the data that can be streamed,and divide the task into several sub-tasks in the form of multi-task streams for data movement and kernel execution.The data streaming analysis module adds the analysis of the inner loop on the basis of the analysis of the outermost loop.Aiming at the situation that the data cannot be transmitted in blocks in the first layer cyclic analysis,our approach determines whether to stream the inner data through the analysis of the inner data and outer information,so as to expand the analysis scope of data streaming and bring additional performance improvement.The tool can generate OpenCL multi-task streaming code with multiple kernels,and the number of kernels depends on the number of parallel loops in the source code.Our experimental results show that multi-task streamed code can achieve better performance than non-streamed code.The performance speedup can reach 1.1x(up to1.4x)on the CUDA platform and 1.4x(up to 2.8x)on the Intel platform.Therefore,the OpenCL multi-task streamed code generated by this method can effectively overlap data movement and kernel execution,so as to improve the overall performance of the program.(2)Automatic analysis and optimization of multi-task streamed codes.In the thesis,the code optimization module is extended to the overall code generation framework,which can optimize the host side code and device side code respectively.On the one hand,we implement automatic redundancy optimization in the host code.We perform data redundancy analysis on the generated multi-task streamed OpenCL code with multiple kernels to avoid unnecessary data movements.By analyzing multiple parallel loops and the code between parallel loops,the method leaves the data required by the subsequent loop on the device side until all kernels are used or accessed by the host side.On the other hand,we implement index transformation,memory optimization and polyhedron optimization in the device side code.First,we change the circular dimension of the kernel code,exchange the OpenCL index space for multidimensional data,increase the number of threads executing the program on the device side,and improve the overall performance of the program.Second,we optimize memory access.Using the private memory of the work item,the frequently accessed data is copied from the global memory to the private memory,so as to reduce the access to the global memory and improve the data locality of the access.Third,we use the existing polyhedron extraction tools to optimize the data layout of the kernel code,and use the local memory of the working group to further improve the data access efficiency.The experimental results show that compared with the code generated by the existing automatic compilation tool ppcg,the optimized code generated by our tool can obtain 1.1x(up to 1.2x)the performance improvement and1.2x(up to 1.7x)the performance improvement on the Intel platform.
Keywords/Search Tags:Multi-task streams, OpenCL, Code generation, Code optimization
PDF Full Text Request
Related items