Vectorization and Register Reuse in High Performance Computing

Posted on:2015-09-04

Degree:Ph.D

Type:Thesis

University:The Ohio State University

Candidate:Stock, Kevin Alan

Full Text:PDF

GTID:2478390017993642

Subject:Computer Science

Abstract/Summary:

Automatic vectorization is crucial for attaining maximum performance on modern processors in a performance portable fashion. There is significant room for improvement over techniques in existing compilers, as demonstrated in this work, through careful synthesis of vectorized code and application of aggressive performance optimizations, especially in the domains of tensor contractions and stencils. Tensor contractions are a class of computation used extensively in computational chemistry which are essentially higher-dimension matrix multiplication. Vector code synthesis for tensor contractions has been explored, utilizing a novel in-register transpose for efficiently accessing tensors with SIMD vectors along any dimension. Aggressive use of unroll-and-jam and loop permutation optimizations is able to increase register reuse and amortize the cost of expensive vector operations.;Stencils are a critical part of many scientific codes including PDE solvers and image processing. This work presents a framework for reordering operations in regular loop computations, such as stencils, by utilizing the associativity and commutativity of operations. The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers. The reordering of operations enables substantially improved register reuse through reduced register pressure, resulting in up to 4x performance improvement for existing benchmarks. The multidimensional retiming formalism characterizes the space of valid implementations in conjunction with other transformations from the vector code synthesis used for tensor contractions.;Given the transformations used to optimize high performance software, there are many parameters which create a space of possible variants from which the best, or nearly best, candidate must be found. In addition to auto-tuning over the space of synthesized vector codes, two models have been developed for selecting a high performance variant at compile time. The first model is based on estimating the total number of SIMD instructions executed and demonstrated and for a set of benchmark kernels achieved a mean performance of 73% of the best observed performance across multiple architectures and compilers. The second model used features extracted from the assembly code generated by the compiler which were then processed by machine learning algorithms. By considering a linear combination of the most successful models, a mean performance of 89% of the best observed was achieved in the same tests as the first model.;This research has shown that integration of tile code generation techniques with existing automatic tiling and parallelizing algorithms based on the polyhedral model can produce significant performance improvements on modern architectures such as Intel's Many Integrated Core. Additionally, these techniques have been applied in production software to automatically produce high performance code on multiple architectures for the dominant computational kernel in MADNESS.

Keywords/Search Tags:

Performance, Vector, Register reuse, Code, Tensor contractions

Related items

1	Integrated compiler optimizations for tensor contractions
2	Investigation On Code Optimization Techniques Of Register Pair For Vector VLIW DSP
3	Research And Implementation Of Real-time Detection Methods For Code Reuse Attacks
4	Investigation On Optimization Algorithm Of Complementary Use Of Different Registers Of Symbol Register Spilling For Vector DSP Processor
5	Research On The Attack And Defense Techniques Of Code Reuse
6	Research On Code Reuse Attack Protection Technique Based On Virtual Machine Monitor
7	Research On Implicit Constant Defense In Just-in-time Compilers
8	Research On Condition Code And Register Allocation Optimization In Co-design Emulation
9	Prevention Of Code Reuse Attacks Through Separating The Data From The Code
10	A Research On Code-reuse Attacks And Detection Techniques