Improving parallelism and data locality with affine partitioning

Posted on:2002-02-10

Degree:Ph.D

Type:Dissertation

University:Stanford University

Candidate:Lim, Amy Wingmui

Full Text:PDF

GTID:1468390011996430

Subject:Computer Science

Abstract/Summary:

To effectively harness the power of modern parallel machines, it is important to find parallelism that does not incur high synchronization cost between the processors. Effective utilization of the memory hierarchy is another crucial factor in getting high performance. In the past, a large number of compiler algorithms have been proposed in the areas of finding loop-level parallelism and improving data locality. Many of these algorithms rely on ad hoc techniques to select a series of transformations. Others attempt to place a number of transformations in a single framework to avoid ad hoc selections. Unfortunately, frameworks proposed so far are either limited to perfectly nested loops or they lack algorithms to select the best possible transformations in the frameworks.; This dissertation proposes a new transformation framework for the program domain of arbitrary loop structures with affine array accesses and loop bounds. This framework unifies a large class of transformations, including loop interchange, reversal, skewing, fusion, fission, reindexing, scaling, and statement reordering. In this framework, each statement is given its own affine mapping describing the partition of its dynamic instances to different processors or to different sequential steps. Two algorithms, one for parallelization and one for locality optimization, were developed under this framework and can be easily combined.; The parallelization algorithm derives from data dependence constraints the affine mappings that maximize the degree of parallelism with the least amount of synchronization. The degree of parallelism found by the algorithm is optimal with respect to all the unified transformations. The algorithm also minimizes communication by trading off excess degrees of parallelism and by choosing pipeline parallelism over doall parallelism if it can significantly reduce communication cost. Optimizing memory performance in a single processor, the locality algorithm performs aggressive affine transformations to separate the independent threads in a program and if possible, place statements with data reuse into perfectly nested loops. These transformations also have the benefits of enabling blocking and array contraction to be applied across arbitrarily nested loops.

Keywords/Search Tags:

Parallelism, Affine, Transformations, Nested loops, Data, Locality

Related items

1	Studies On Data Locality And Compiler Optimization Techniques
2	Optimizing locality and parallelism through program reorganization
3	Memory and parallelism optimization for embedded systems
4	Locality transformations for adaptive irregular applications
5	Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance
6	Invariant curve/surface alignment in the presence of affine and some nonlinear transformations with application in image registrations
7	Automatic parallelization of affine loops using dependence and cache analysis in a binary rewriter
8	A Scalable Locality-aware Adaptive Work-stealing Scheduler for Multi-core Task Parallelism
9	The Research And Implementation Of Parallel Computing Method On MPCore Multicore Processor
10	Dirichlet Process Mixture Models for Nested Categorical Data