Font Size: a A A

Research On Scheduling Methods Of Massive Parallel Processors For Resource And Performance Optimization

Posted on:2016-01-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y L YuFull Text:PDF
GTID:1318330482467631Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Massive parallel processors with a huge number of computing units bring performance speedups for many applications by their high parallelism. A general-purpose graphic processing unit (GPGPU) based on the GPU hardware is a paradigmatic example of massive parallel processors. For the programming environments such as CUDA and OpenCL, GPGPU becomes a hot topic in the high performance computing. More and more operating systems, image processing and rendering software, and scientific computing software offload their computations to GPGPU for optimized performance. There are many achievements on the GPGPU applications published each year, which expand the GPGPU applicable fields, and surpass the current best performance of many algorithms.However, for the different architecture from CPU, a GPGPU or a massive parallel processor brings a huge challenge to exploit its performance potential. In the software layer, it requires to take advantages of various GPGPU computing resources based on their diverse features to achieve optimal performance. In the hardware layer, it requires to upgrade the architecture design and optimize the scheduling mechanism and schemes to achieve a high resource utility and a low hardware overhead. The GPGPU scheduling system is critical to its resource applicability and performance. In this dissertation, the GPGPU scheduling optimization strategies are categoried into three aspects, including resource assignment, execution order, and parallelism granularity. From these three aspects, several scheduling optimizations on various levels in the GPGPU scheduling system are proposed according to the observations and studies on the existing scheduling mechanism and schemes. Their details are listed below.(I) The resource assignment mechanism allocates and manages the GPGPU computing resources to avoid the faulted allocation and the access collisions. The GPGPU memory resources are attention here. (1) The multi-address-space memory model employed by GPGPU complicates the data management. An address-space encapsulation is proposed, which combines the multiple data copies into one data structure. It simplifies the resource utilization without a performance loss. (2) The static relationship between the specific memory resources and the GPGPU kernels leads to the access collisions in multi-thread GPGPU programs. Using texture memory for a case study, a memory resource pool is proposed to change the static relationship into a dynamic allocation and management, which improves the resource utilization of the specific memory.(II) The schemes on execution order optimize the assignment of computing tasks, such as threads, cooperative thread arrays (CTAs), and kernels, in time and space according to their dependencies and resource requirements. The optimizations on threads and CTAs are attention here. (1) The design of the single-instructuion-multi-thread (SIMT) co-scheduling execution model employed in GPGPU is analyzed by the VCPU co-scheduling in a virtualized environment. For the time slot fragments caused by the VCPU co-scheduling, fine-grained co-scheduling optimizations are prposed by reducing the co-scheduled CPUs. The study validates the soundness of the co-scheduling in GPGPU and also the performance benefits from a fine-grained co-scheduling optimization. (2) Load balance is an improtant factor for the GPGPU performance. Because of the data locality optimization in memory controller, existing CTA scheduling schemes suffer from a load imbalance in the CTA issuance. A patch module is designed in the CTA scheduler, which uses credits to limit the CTA over-issuance to optimize the load balance and improve the performance. The design is compatible with existing scheduling schemes.(?) The optimization on parallelism granularity is required to give a match parallelism between the hardware and the computation, such as codes, kernels, and threads, to reduce the scheduling overhead and resource congestion, while keeping a high resource utility. The parallel repacking in source codes and the runtime thread level parallelism (TLP) are attention here. (1) Exisiting parallel repacking code transformation approaches covered inadequate statement patterns. Based on a recursive source code model, a novel automatic parallel repacking code transformation method is proposed, which covers the synchronization statements inside a mutli-layered divergence or loop. (2) Existing TLP optimizations in CTA scheduler are coarse-grained and not so accurate. A stall-aware warp scheduling scheme (SAWS) is proposed to dynamically optimize the TLP at a fine granularity for a high pipeline efficiency. A study on the TLP optimization by combining both CTA and warp schedulers shows that a fine-grained TLP optimization in the warp scheduler is a better choice.
Keywords/Search Tags:Massive Parallel Processor, GPGPU, Scheduling System, Resource Utilization Optimization, Performance Optimization
PDF Full Text Request
Related items