Font Size: a A A

Research On Programming Model And Compiler Optimizations For CPU-GPU Heterogeneous Parallel Systems

Posted on:2013-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:T TangFull Text:PDF
GTID:1118330362960104Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuing improvements in VLSI technology, more than 1 billion transis-tors have been integrated onto one single chip. However, due to the feature size limita-tion of the CMOS technology, chip frequency cannot be improved significantly beyond 4GHz. Consequently, multi-core parallelism becomes the main technical approach to im-prove the performance and usage of on-chip resources. Presently, general purpose CPUswith 4-8 cores have become the mainstream of the market, while some special purposeprocessors such as stream processors and GPUs (Graphics Processing Units) contain ten-s or hundreds of processing cores. In this thesis, we focus on the GPU, a popular andpromising commercial stream processor architecture, to carry out our researches.GPU is originally designed to accelerate the graphics processing, and thus has asimpler architecture than CPU. Without those complex logic functions such as branchprediction and out-of-order execution, GPU can use transistors more directly to improvethe computing performance, which renders GPU a much higher peak performance thancontemporary CPU. Along with the development of their instruction-level function andimprovements of their programming interfaces, there has been a growing interest in us-ing GPUs to accelerate non-graphics applications, known as a brand new research direc-tion——GPGPU (General Purpose Computation on GPUs). Constructing heterogeneousparallel systems using CPUs and GPUs, in which CPUs provide general basic computingenvironment, and GPUs provide special powerful computing capacity, has become an im-portant trend in high performance computing area. Currently, GPU has already exhibiteda rather promising prospect ranging from high performance computing area to desktopcomputing area and to embedded computing area. Therefore, researches on the GPU andthe CPU-GPU heterogeneous system have been carried out in multiple aspects, includingprogramming model, performance analysis and optimization, reliability optimization andlow power optimization, etc. In this thesis, oriented towards two aspects: programmingand compilation, we first study the programming model for CPU-GPU heterogeneousparallel systems, and then we put emphasis on the analysis and optimizations of GPU'smemory accessing. Finally we study the design, implementation and optimization of acompiler for the programming model we propose.Programming model is the interface between the computer system and the program- mer, and is an important measurement of a system's usability, which directly influencesthe system's acceptance degree. In recent years, from low-level graphics API to abstractmodels such as Brook+, CUDA and OpenCL, GPU's programming interface keeps im-proving, which significantly thrusts the GPU forward to the general purpose computa-tion. However, compared to those of traditional CPUs, GPU programming models arestill more complex. A large amount of existing applications cannot be inherited efficient-ly, which poses a big challenge to GPU software development and transplantation. Toaddress this problem, in this thesis we start with the traditional and widely accepted pro-gramming model OpenMP. Based on the transferability analysis between the OpenMPparallel primitives and the stream model, we evaluate the feasibility of programming theCPU-GPU computing node with an OpenMP-likemodel. Then, we proposeOpenStream,an OpenMP model enhanced by a group of compiler directives with stream processingfeatures.Subsequently, we perform studies on performance analysis and optimizations fromthe view of memory accessing. The memory wall problem is a bottleneck that throttlesGPU's practice performance, while cache is the critical memory hierarchy to alleviate thememory wall problem. Early GPUs do not contain the data cache in the traditional sense.Their on-chip memories are primarily used to buffer the data related to the graphics pro-cessing. Driven by the needs of general purpose computing, GPUs began to incorporategeneral data cache in recent years. Therefore, analysis of cache behavior and cache op-timization are of great importance to improve the GPU's computing efficiency. To thisend, in this thesis we first propose a memory-aware scalability analysis model for the G-PU architecture, which analyzes the relationship between the performance scalability andthe memory hierarchy. The model theoretically evaluates the importance of a process-ing core's private cache to the performance scalability. Afterwards, we perform studieson cache analysis and optimizations for GPU programs. Note that the traditional cacheanalysis and optimization methods cannot be applied to the GPU directly due to its par-ticular execution model. To address this problem, we propose a reuse analysis method forGPU programs according to the processing core's execution model. The condition thatthe reuse can be captured by the cache and several locality optimization methods are thendiscussed. Moreover, to model the cache behavior of GPU programs more accurately, wepropose a cache miss analysis model based on the stack distance profile analysis method, thus setting the foundation for evaluating other cache optimization methods in the future.Finally, we discuss the compiler design for the OpenStream programming model,structuralize all analysis and optimization methods discussed above into a uniform com-piler framework and propose a basic implementation. In this implementation, we putstress on a chip level stream scheduling method for the CPU-GPU heterogeneous system,which prolongs the life cycle and exploits the locality of data at the GPU end, thereforereducing redundant data communications between CPU and GPU.The innovations of this thesis are as follows:1. To solve the programming problem for CPU-GPU heterogeneous parallel system-s, we summarize the essence of programming models for heterogeneous parallelcomputing node, evaluate the transferability between the OpenMP parallel primi-tivesandGPUprogrammingmodels, andsetthetheoreticalfoundationfortheGPUprogrammingmodelresearchbasedontheOpenMPmodel; weproposeanewcom-piler directive based programming model OpenStream by extending the OpenMPmodel with a group of compiler directives with stream processing features, so as toease the program design and porting for CPU-GPU heterogeneous parallel systems;2. Toguidetheon-chipmemoryoptimizationsforGPUprograms,weproposeamemory-aware scalability analysis model for GPU-like many-core architectures, analyze therelationship between the performance scalability and the memory hierarchy, and in-dicate the architecture optimization principles for future many-core processors andthe on-chip memory optimization emphasis for GPU programs from the view ofscalability;3. ToimprovethedatalocalityforGPUprograms,weproposeaquantificationlocalityanalysis model for GPU kernel programs through analyzing the iteration sequenceof kernel execution and extending the traditional reuse analysis theory to GPU'sparallel execution model; we propose two locality optimization methods accordingto the locality solution process, which efficiently reduce the cache miss rate andimprove the performance of GPU programs;4. To precisely model the cache miss behavior of GPU programs, we propose an ac-curate cache miss analysis model for GPU programs based on classical cache miss equationandcachecontentionanalysismodels,decomposethecacheanalysisprob-lem into the stack distance profile analysis of single thread block and cache con-tentionanalysisformultiplethreadblocks, thuslayingthefoundationforefficientlyevaluating cache optimization methods for GPU programs.5. Tovalidatetheprogrammingmodelandoptimizationmethods,wedevelopasource-to-source compiler framework for OpenStream programming model and proposea basic implementation of it; we discuss a heuristic communication schedulingmethodintheimplementation,whichprolongsthelifecycleandexploitstheproducer-consumer locality of data at the GPU end on the premise of maintaining the dataconsistency, thusreducingredundantdatacommunicationsbetweenCPUandGPU.
Keywords/Search Tags:GPU, Heterogeneous parallel system, Programming model, S-calability, Reuse, Locality, Cache analysis, Compiler, Communication scheduling
PDF Full Text Request
Related items