Optimizations Of Memory-access For Stencil Computations On Shared-memory Multi-core Processor

Posted on:2016-04-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y S Dong

Full Text:PDF

GTID:2348330536467716

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Shared-memory multi-core and multi-level Cache architectures have been widely used in high performance computing.Although Multi-level Cache have been proved to be effective in alleviating the �memory wall�,the efficiency of memory access is still low for scientific programs because of the large number of memory accessing instructions.Besides,multi-core parallelism of the programs has higher requirements for memory bandwidth.Hence,reducing frequency and hiding latency on memory access is the focus of research on memory access optimizations.Stencil computations are an important class of memory intensive computing kernels used in a variety of application domains ranging from image and video processing to simulation and computational science applied in several areas of natural science.Recently,Stencil computations have become the target of optimization for more and more researchers,such as parallelism,communication,and load balancing,but the research on memory access optimizations also needs further research.This dissertation focuses on the optimizations of memory access for stencil optimization on SMP platforms,including loop tiling,vector permutation and data prefetching.The main contribution and innovation of this dissertation are as follows:1.The blocking method of loop tiling is improved,and a parallel algorithm based on data block binding to OpenMP thread is proposed.The improved blocking method synthetically considers the parallel of multi-core or multi-thread and the structure of multi-level Cache of the platform.The new parallel algorithm can not only solve the problem of high parallel overhead in traditional parallel algorithm effectively,but also take full advantage of the reusing data between every two adjacent blocks.2.Vector permutation is used to reduce the redundant memory access in vectorized stencil computations,and a method of vector permutation based on splicing and shifting is proposed.Considering the memory access' s specialties of stencil computations,the optimization of SIMD is feasible,and some data elements are reused among some vector.Hence,Vector permutation is used to reduce the count of store/load and promote the efficiency on memory access.Besides,most methods of vector permutation are proposed for stencil computations.Especially,the method based on splicing and shifting can effectively decrease the number of vector operation.3.Data prefetch is used to hiding the memory access latency in stencil computations.Data prefetch takes full advantage of the idle bandwidth to access the data,and hide the latency by overlapping memory accesses with computation significantly.By analyzing the mechanism of hardware prefetch and software prefetch on Intel X86₆4 platform,software data prefetch is used to optimize stencil computations on both continuous and discontinuous mode of memory access.Otherwise,loop unrolling and loop peeling are used to optimize the software data prefetch.The experiment results show that software data prefetch is beneficial to the stencil computations with discontinuous memory access.

Keywords/Search Tags:

Stencil Computation, SMP, Multi-level Cache, Loop Tiling, SIMD, Vecto8r Permutation, Data Prefetch

PDF Full Text Request

Related items

1	Research On The Performance Optimizations For Stencil Computations On ARM High-performance Processor
2	Research On Performance Optimizations Of Stencil Computations On Domestic Heterogeneous Many-core Processor
3	Multi-dimension And Multi-level Associated Cache PartitioningMechanism In CMP
4	Automatic Generation And Performance Optimization Of Code In Stencil Computation
5	Research Of SIMD Vectorization Algorithm And Regrouping Technology
6	Automatic Generation And Optimization Of Data Permutation Instructions For Simd Devices
7	Automatic Generation And Optimization Of Data Permutation Instructions For SIMD Devices
8	Design And Implementation Of A Web Cache And Prefecthing System
9	Performance Optimization of Stencil Computations on Modern SIMD Architectures
10	Classification-based Prefetch-Aware Cache Partition Mechanism