Font Size: a A A

Vectorization and Register Reuse in High Performance Computing

Posted on:2015-09-04Degree:Ph.DType:Thesis
University:The Ohio State UniversityCandidate:Stock, Kevin AlanFull Text:PDF
GTID:2478390017993642Subject:Computer Science
Abstract/Summary:
Automatic vectorization is crucial for attaining maximum performance on modern processors in a performance portable fashion. There is significant room for improvement over techniques in existing compilers, as demonstrated in this work, through careful synthesis of vectorized code and application of aggressive performance optimizations, especially in the domains of tensor contractions and stencils. Tensor contractions are a class of computation used extensively in computational chemistry which are essentially higher-dimension matrix multiplication. Vector code synthesis for tensor contractions has been explored, utilizing a novel in-register transpose for efficiently accessing tensors with SIMD vectors along any dimension. Aggressive use of unroll-and-jam and loop permutation optimizations is able to increase register reuse and amortize the cost of expensive vector operations.;Stencils are a critical part of many scientific codes including PDE solvers and image processing. This work presents a framework for reordering operations in regular loop computations, such as stencils, by utilizing the associativity and commutativity of operations. The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers. The reordering of operations enables substantially improved register reuse through reduced register pressure, resulting in up to 4x performance improvement for existing benchmarks. The multidimensional retiming formalism characterizes the space of valid implementations in conjunction with other transformations from the vector code synthesis used for tensor contractions.;Given the transformations used to optimize high performance software, there are many parameters which create a space of possible variants from which the best, or nearly best, candidate must be found. In addition to auto-tuning over the space of synthesized vector codes, two models have been developed for selecting a high performance variant at compile time. The first model is based on estimating the total number of SIMD instructions executed and demonstrated and for a set of benchmark kernels achieved a mean performance of 73% of the best observed performance across multiple architectures and compilers. The second model used features extracted from the assembly code generated by the compiler which were then processed by machine learning algorithms. By considering a linear combination of the most successful models, a mean performance of 89% of the best observed was achieved in the same tests as the first model.;This research has shown that integration of tile code generation techniques with existing automatic tiling and parallelizing algorithms based on the polyhedral model can produce significant performance improvements on modern architectures such as Intel's Many Integrated Core. Additionally, these techniques have been applied in production software to automatically produce high performance code on multiple architectures for the dominant computational kernel in MADNESS.
Keywords/Search Tags:Performance, Vector, Register reuse, Code, Tensor contractions
Related items