Font Size: a A A

Improving parallelism and data locality with affine partitioning

Posted on:2002-02-10Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Lim, Amy WingmuiFull Text:PDF
GTID:1468390011996430Subject:Computer Science
Abstract/Summary:
To effectively harness the power of modern parallel machines, it is important to find parallelism that does not incur high synchronization cost between the processors. Effective utilization of the memory hierarchy is another crucial factor in getting high performance. In the past, a large number of compiler algorithms have been proposed in the areas of finding loop-level parallelism and improving data locality. Many of these algorithms rely on ad hoc techniques to select a series of transformations. Others attempt to place a number of transformations in a single framework to avoid ad hoc selections. Unfortunately, frameworks proposed so far are either limited to perfectly nested loops or they lack algorithms to select the best possible transformations in the frameworks.; This dissertation proposes a new transformation framework for the program domain of arbitrary loop structures with affine array accesses and loop bounds. This framework unifies a large class of transformations, including loop interchange, reversal, skewing, fusion, fission, reindexing, scaling, and statement reordering. In this framework, each statement is given its own affine mapping describing the partition of its dynamic instances to different processors or to different sequential steps. Two algorithms, one for parallelization and one for locality optimization, were developed under this framework and can be easily combined.; The parallelization algorithm derives from data dependence constraints the affine mappings that maximize the degree of parallelism with the least amount of synchronization. The degree of parallelism found by the algorithm is optimal with respect to all the unified transformations. The algorithm also minimizes communication by trading off excess degrees of parallelism and by choosing pipeline parallelism over doall parallelism if it can significantly reduce communication cost. Optimizing memory performance in a single processor, the locality algorithm performs aggressive affine transformations to separate the independent threads in a program and if possible, place statements with data reuse into perfectly nested loops. These transformations also have the benefits of enabling blocking and array contraction to be applied across arbitrarily nested loops.
Keywords/Search Tags:Parallelism, Affine, Transformations, Nested loops, Data, Locality
Related items