Research On SIMD Vectorization Of Loop Nests And Its Optimization Techniques

Posted on:2015-09-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y S Hou

Full Text:PDF

GTID:1108330482479093

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

SIMD (Single Instruction Multiple Data) extension is widely used in the architecture of HPC, which operates multiple char, integral or float datas simultaneously via a SIMD vector register and thereby achieving data level parallelism. SIMD extension has become an absolutely necessary acceleration component of high-performance microprocessor due to its lower cost price and power dissipation, as well as its higher computing efficiency. As their functions are increasingly consummate, the application area of SIMD extensions is spreading from conventional multimedia processing to scientific computing, digital signal processing, code breaking, etc. As a hardware implementation of the architecture, SIMD extensionâ€™s efficiency can be adequately exploited with the help of programmers. It brings about heavier work overload and makes the SIMD programs more error-prone. As a result, the research community turns the attention to vectorizing compilers, expecting they can be equipped with automatic SIMD vectorization ability. However, the performance of vectorized codes generated by existing vectorizing compilers falls far behind what the users expect. Consequently, how to enhance the performance of automatic SIMD vectorization has become a challenging problem in this area.Automatic SIMD vectorization is such a technique that translates serial programs into vectorized codes with SIMD instructions. It was first implemented by conventional vector parallelism techniques. The research community has accumulated various applicable transformation techniques during the past 30-20 years. Current vectorizing compilers can generate efficient vector codes with SIMD instructions for regular loop nests. Nonetheless, existing techniques always fail when they are confronted with irregular loop nests. Based on a source-to-source vectorizing compiler system SW-VEC, this dissertation studies numerous vectorization and optimization techniques for irregular loop nests, targeting on the domestic Sun Way 1600 general processors. Specifically, the main contributions of this work are as follows.1. Currently, the parallelism of SIMD extensions still relies on conventional data dependence analysis techniques. However, their parallelism characters cannot be quantified due to the difference in the architecture, making the parallelism analysis error-prone. Focusing on this problem, this work characterizes the parallelism of SIMD extensions by introducing the concept of restricted parallelism. By synthesizing dependence distance, dependence direction and other dependence characters, we propose a SIMD extensions targeted parallelism analysis method, as well as a reversed dependence graph based Tarjan algorithm. While addressing the optimization problems which do not violate dependences in programs, this method also provides neccessary SIMD parallelism analysis information for dependence oriented loop pre-analysis techniques. Sequentially, the SIMD parallelism identification rate of loop nests can thus be enhanced. The evaluation results show that the proposed SIMD parallelism analysis method and loop pre-analysis technique are able to exploit more SIMD vectorization opportunities for backend optimization phases.2. Numerous loop nests cannot be transformed into the polyhedral framework due to their irregular structures, resulting in the loss of the polyhedral model based SIMD optimization opportunities in the following the compilation process. It has become an important reason limiting the effect of SIMD optimizations. To solve this problem, this thesis presents a static-dynamic hybrid compilation approach. According to the analysis of programsâ€™profiling information, we propose a dynamic determination criterions based speculative SIMD optimization technique, as well as a speculative SIMD parallel code generation framework. The code generation framework is able to dynamically determine the execution opportunities of serial and SIMD vector codes according to dynamic determination criterions, so as to achieve the SIMD optimizations of loop nests with irregular structures.3. Loop nests also bring about more difficulties to SIMD vectorization when exposing more parallelization potentials, one of which is to choose a SIMD vectorization scheme with the best performance. To address this issue, we introduce a polyhedron modeling approach for the SIMD vectorization framework whose optimization techniques are loop permutation, loop peeling and statement scheduling, etc. and implement a SIMD vectorization scheme generation algorithm on this basis. Based on the proposed SIMD vectorization framework, we also design a SIMD vectorization cost model for unaligned array references. As a result, by combining the proposed solutions, a minimum SIMD vectorization cost based optimized SIMD vectorization scheme selection algorithm is achieved, which is supposed to be able to solve the SIMD vectorization problem of loop nests. The experimental results show that the vectorization scheme selected by the proposed method is the same as the hand-vectorized programs, and the SIMD optimization results of loop nests are thus improved.4. SIMD Extensions perform the access of vector data with multi-level cache architecture components, so the locality of vector data becomes an important factor affecting SIMD optimization results. In order to improve cache hit rate, this paper proposes an adaptive platform-aware loop tiling scheme. Not only is hardware environment considered, but computation factors are also taken into account. By synthesizing the data computed during each loop iteration and cache capacity, the method determines the size of loop tiling dynamically according to the SIMD registerâ€™s width, first level Cache capacity, second level Cache capacity and programsâ€™ demand to data respectively. Consequently, the hit rate of data in cache is enhanced greatly. The evaluation results show that the proposed optimization approach can achieve better data locality for different hardware platforms and different applications, so that the SIMD optimization results are further advanced.5. Due to the inherent drawback of SLP (Superword Level Parallelism) algorithm, the generated SIMD vectorized codes after applying the SLP method usually have more parallelism which can be further exploited. To take full advantage of these optimization opportunities, this paper presents a maximum matching degree based vector regroup optimization technique. By introducing the concept of vector matching degree, it seeks and matches the vector whose width has the least significant difference with regroup vector, so that the total number of vector regroup instructions can be reduced and the execution efficiency of SIMD codes is thus enhanced. The experimental results show that the proposed technique can reduce the number of vector regroups instructions efficiently, achieving the expecting optimization effect for the tested programs.

Keywords/Search Tags:

Parallel Optimization, Automatic SIMD vectonzation, Dependence Analysis, Cost Estimation, Speculative Parallelization, Polyhedral model, Loop Optimization

PDF Full Text Request

Related items

1	The Research Of Parallelization Technique Based On Shared Memory Structure And Optimization
2	Research On Loop Optimization Based On Polyhedral Model
3	Research On Optimization Technologies Of Parallel Compilation For Distributed Memory Architecture
4	Research On Automatic Parallelization And Optimization Technologies For Shared Memory Architecture
5	Research On SIMD Auto-vectorization Optimization Technologies
6	Research On Loop Optimization In Compiling System Of Coarse-Grained Reconfigurable Architectures
7	Research Of Software Speculative Parallelization Based On Transactions
8	Research On Cost Model And Memory Optimization Techniques For Heterogeneous Processor
9	Research On Key Optimization Technology Of Image Programming Based On SIMD Platform Adaptive Programming Model
10	Performance Optimization On Multicore Transactional Memory Architecture Supporting Speculative Parallelization