Font Size: a A A

High Productivity OpenMP For Distributed Shared Memory Architecture

Posted on:2008-10-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:C HuangFull Text:PDF
GTID:1118360242499266Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays, high-end computing has changed its ambition from the pure pursuit of high performance to the realization of high productivity systems, which includes the improvement in performance, programmability, portability and robustness, and the reduction of costs in development, running and maintenance of systems. High productivity computer systems must be supported with high productivity programming environments. Furthermore, the applications confronting the future teraflops and petaflops systems are multidisciplinary and multiscale, whose complexity requires domain experts and software scientists from different disciplines to work together for development, management and maintainence. Such kind of participation puts higher requirements to the performance, programmability, portability and fault-tolerance of programming environments. With such features as easy programmability, supporting incremental design patterns, good maintainability and high portability, OpenMP will be the mainstream parallel programming language in the long run.Focusing on development of high productivity OpenMP programming environment for large-scale parallel systems, this thesis systematically investigates some key techniques in implementing OpenMP on large-scale distributed shared memory (DSM) systems, DSM-oriented OpenMP extensions, compiler-guided data prefetching, checkpoint/restart and OpenMP-oriented low-power optimization and others related techniques of OpenMP. The main contributions of the thesis are as follows.1. CCRG OpenMP, an OpenMP parallel compiler, has been designed and implemented for large-scale parallel computer systems. We present the compiling-time and linking-time coordinated OpenMP shared data placement strategy, which not only overcomes the disadvantage that shared memory is required to explicitly allocate in distributed OS, but also provides support for data locality optimization of Checkpointing. Several source-level optimization techniques are used to improve performance. The practical experiments show the performance of CCRG OpenMP on our SCCMP system is equal to that of Intel compiler 9.1 on SGI Altix.2. Two OpenMP directives BARRIER (thread_id) and ALLREDUCTION have been presented to reduce the rapid-increasing overhead in such global operations as barrier and reduction incurred when the scale of OpenMP parallel programs is enlarged, and the implementing algorithms of the new directives are given. The experiments show that for real scientific application Plasma Physics, when the number of threads is 64, the performance has been increased 76%.3. The compiler-directed two-stage data prefetch algorithm has been presented to overcome the inaccuracy incurred by the inconsistency between remote access latency and local access latency. The algorithm is evaluated by means of a static performance analysis model. The experiments show that, by using the presented algorithm, the performance has been improved 14% for swim in SPEC OMP2001 when the number of threads is 32, and 9% when the number of threads is 64.4. We have presented the system-level and application-level coordinated OpenMP Checkpoint/Restart mechanisms, and a blocked OpenMP Checkpoint protocol. Based on these mechanisms, a CCRG OpenMP Checkpoint/Restart system has been implemented. The system provides the complete supports for OpenMP 2.0 API, with good scalability and applicability.5. Energy optimization techniques are studied based on OpenMP programming model. Three energy optimization methods and implementations are presented for parallel systems with dynamic voltage scaling (DVS) capabilities. The barrier section based analysis of worst-case execution-time (WCET) and DVS methods are proposed for WCET based energy optimization. These methods use barrier section as the unit of analysis and voltage scaling, which avoid the impact of barrier on program execution and energy consumptions caused by load imbalance due to barrier. An analysis model is built and the simulation shows that these techniques can effectively reduce energy consumptions for parallel systems.
Keywords/Search Tags:High-Productivity, OpenMP, OpenMP Externsion, Two-Stage Data Prefetch, Checkpoint/Restart, Low-Energy Optimization
PDF Full Text Request
Related items