Font Size: a A A

Performance of parallel algorithms on a broadcast-based architecture

Posted on:2004-01-13Degree:Ph.DType:Thesis
University:Drexel UniversityCandidate:Narravula, Harsha VFull Text:PDF
GTID:2468390011970619Subject:Engineering
Abstract/Summary:
Research in high-end computing has produced enormous benefits to society. While new data- and computation-intensive applications are appearing all the time, there is evidence that present scalable parallel architectures may not be well suited for these applications. To achieve petaflops computing, advances in hardware technology, architecture, system software, and programming environments is needed.; Due to advances in fiber optics and VLSI technology, interconnection networks, which allow multiple simultaneous broadcasts, are becoming feasible. The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) is a low-latency high-bandwidth, fiber-optic network with a unique feature that every processor is directly connected to the other processor through a dedicated broadcast/output channel. This thesis presents the multiprocessor architecture of the SOME-Bus and examines the performance of representative algorithms for matrix operations and sorting using the message-passing and distributed-shared-memory paradigms. It shows that simple enhancements to the network interface and the cache and directory controllers can greatly improve the performance; for example, the communication time of a matrix-vector multiplication algorithm is reduced to O(1) using DSM.; Existing parallel loop schemes are extended to make them suitable for the high-end system under study. Efficient mapping of existing parallel software to the system is studied. Software is implemented, tested and evaluated for performance on a simulator developed for the system. The thesis also presents enhancements to the network interface and the cache and directory controllers, which allow significant overlap of processing time with the communication time due to compulsory misses. Results from the simulated execution of simple algorithms such as the matrix-matrix multiplication on the SOME-Bus show that block capture and prefetch combined with an effective block replacement policy succeed in significantly reducing the miss rate due to compulsory misses as the cache size increases, while a similar increase of cache size in traditional architectures leaves the miss rate (due to compulsory misses) unaffected.
Keywords/Search Tags:Compulsory misses, Performance, Parallel, Algorithms, Due, Cache
Related items