Performance of parallel algorithms on a broadcast-based architecture

Posted on:2004-01-13

Degree:Ph.D

Type:Thesis

University:Drexel University

Candidate:Narravula, Harsha V

Full Text:PDF

GTID:2468390011970619

Subject:Engineering

Abstract/Summary:

Research in high-end computing has produced enormous benefits to society. While new data- and computation-intensive applications are appearing all the time, there is evidence that present scalable parallel architectures may not be well suited for these applications. To achieve petaflops computing, advances in hardware technology, architecture, system software, and programming environments is needed.; Due to advances in fiber optics and VLSI technology, interconnection networks, which allow multiple simultaneous broadcasts, are becoming feasible. The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) is a low-latency high-bandwidth, fiber-optic network with a unique feature that every processor is directly connected to the other processor through a dedicated broadcast/output channel. This thesis presents the multiprocessor architecture of the SOME-Bus and examines the performance of representative algorithms for matrix operations and sorting using the message-passing and distributed-shared-memory paradigms. It shows that simple enhancements to the network interface and the cache and directory controllers can greatly improve the performance; for example, the communication time of a matrix-vector multiplication algorithm is reduced to O(1) using DSM.; Existing parallel loop schemes are extended to make them suitable for the high-end system under study. Efficient mapping of existing parallel software to the system is studied. Software is implemented, tested and evaluated for performance on a simulator developed for the system. The thesis also presents enhancements to the network interface and the cache and directory controllers, which allow significant overlap of processing time with the communication time due to compulsory misses. Results from the simulated execution of simple algorithms such as the matrix-matrix multiplication on the SOME-Bus show that block capture and prefetch combined with an effective block replacement policy succeed in significantly reducing the miss rate due to compulsory misses as the cache size increases, while a similar increase of cache size in traditional architectures leaves the miss rate (due to compulsory misses) unaffected.

Keywords/Search Tags:

Compulsory misses, Performance, Parallel, Algorithms, Due, Cache

Related items

1	Compiler optimizations for avoiding cache conflict misses
2	Predictive Algorithm For L2 Cache Misses On Chip Multi-Processors
3	Computation of cache misses in matrix multiplication
4	Research On Analytical Modeling Of Memory Subsystem Performance
5	Research Of High Performance Algorithms Utilizing Cache Fully
6	Software Performance Defect Discovery Based On Dynamic Symbol Execution Technology
7	The Army Compulsory Military Service Performance Evaluation Studies
8	The effects of cache coherence on the performance of parallel PDE algorithms in multiprocessor systems
9	Cache performance analysis of algorithms
10	Study On Users’ Consumption Effect Evaluation Of Compulsory Information