Font Size: a A A

Exploring fine-grained process interaction in multiprocessor systems

Posted on:1998-05-15Degree:Ph.DType:Thesis
University:University of MinnesotaCandidate:Johnson, Donald ElmerFull Text:PDF
GTID:2468390014477087Subject:Computer Science
Abstract/Summary:
Several techniques have been used to improve the performance of process interaction in fine-grained multiprocessor systems. These existing techniques tend to have long memory latencies or synchronization times, or they require complex and expensive hardware. This thesis proposes that user-level hardware and special-purpose communications channels for different interaction domains can dramatically improve access performance with relatively modest hardware cost. The thesis characterizes some specific domains for which the hypothesis holds. New lock and barrier mechanisms are presented that reduce both contention and latency to the minimum values that can be obtained using shared-bus communications, requiring at most two shared-bus transactions, with one transaction being typical. Distributed hardware locking queues and barrier flags reduce the latency for process continuation after obtaining a lock or reaching a barrier to near zero. Four additional interaction mechanisms that use serial communication between processing elements (PEs) in a manner that eliminates inter-PE clocking delays are presented. All of these new techniques increase scalability, are applicable to both new architectures and to existing systems, and are less complex than other hardware solutions. The optimum two-dimensional cluster size for N PEs is shown to be proportional to {dollar}(NI/D)sp{lcub}1/2{rcub}{dollar}, where I and D are the mean inter-node times, including gate and time-of-flight, on the global and local loops, respectively. The access latency when optimally clustered is shown to be proportional to {dollar}(NID)sp{lcub}1/2{rcub}{dollar} Using conservative parameters when optimally clustered, the maximum number of PEs for expected latencies of one microsecond are: 15621 PEs for barriers, 61308 PEs for locks, 37698 for shared-data, and 14592 PEs for shared-registers. All mechanisms are shown to have near-optimum performance if the configuration is near-optimum for any particular mechanism. Hierarchies beyond two levels were shown to have expected latencies proportional to the sum of all loop-times.
Keywords/Search Tags:Interaction, Process, Shown
Related items