Font Size: a A A

Performance bottlenecks on large-scale shared-memory multiprocessors

Posted on:2006-06-10Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Kunz, Robert CFull Text:PDF
GTID:1458390008953287Subject:Engineering
Abstract/Summary:
While multiprocessors have existed for many years, most parallel architectures are difficult to program efficiently. The key challenge is how to simplify the programming model so that programmers can write portable highly efficient parallel programs with minimal effort. For example, cache-coherent shared-memory architectures trade the memory system complexity of the coherence protocol for a simpler programming model that does not require communication to be programmed explicitly. Using the FLASH machine, a large-scale cc-NUMA multiprocessor, this dissertation explores the interaction between hardware and software design trade-offs and quantifies the performance gains of memory system enhancements.; Researchers working on multiprocessor memory systems have advocated easing the programming burden by adding enhancements to the memory system designed to reduce memory latency and coherence overhead. Analogous to the lessons learned during the RISC movement over 20 years ago, simpler memory system designs are faster than more complicated ones, primarily because the additional contention present in the memory system overwhelms minor reductions in latency that more complicated protocols provide. Thus, architects should focus on minimizing memory controller occupancy on large-scale multiprocessors rather than just latency.; Even setting aside contention, the coherence protocol is a smaller bottleneck than other system aspects including the operating system's scheduling policies and the applications effective or ineffective use of the cache coherent memory system. Programmers still need to tune their programs to a specific architecture; such tuning limits portability. While coherence protocols might be able to provide a reduction in remote communication, the mismatch between an application and the architecture are often more significant and prevent major performance improvements.; Large-scale multiprocessors continue to remain difficult to program because the memory system alone cannot eliminate the need for programmers to remain aware of implicit communication. The software libraries, compiler, and operating system must apply complex machine-specific optimizations to reduce second- and third-order performance bottlenecks. Therefore, the memory system should provide meaningful visibility and feedback to programming monitoring tools and compilers. Without such tools to assist programmers, the programming advantages of a coherent shared memory multiprocessor versus a message passing multiprocessor are likely to be small for larger processor counts.
Keywords/Search Tags:Memory, Multiprocessor, Performance, Large-scale
Related items