Optimizations Of Scientific Kernels On SW26010 Many-core Processor

Posted on:2019-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:Z G Xu

Full Text:PDF

GTID:2428330590967390

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Sunway TaihuLight,China's domestically designed and developed supercomputer,claims the top place in the latest TOP 500 list with the home-grown SW26010 many-core processor.In a radical departure from the traditional architectural design of CPUs and GPGPUs,the SW26010many-core processor reaches a peak performance of 3.06 TFlops,while its theoretical peak memory bandwidth is only 136 GB/s,resulting in an imbalanced flops per byte ratio.To bring out the potential of SW26010 for scientific computing,our research focused on comprehensive optimizations for the compute-bound as well as the memory-bound scientific kernels on the SW26010 architecture.First,we developed a micro-benchmark suite to evaluate the architectural characteristics of the pipelines,the memory hierarchy and the on-chip register-level communication mechanism.Second,to optimize general matrix multiplication(GEMM),a compute-bound kernel,we implemented a register-level communication based multi-level parallel algorithm,tuned the assembly code for the efficient kernel,and finally achieved over 90% of the peak performance for both DGEMM and SGEMM.To optimize the memory-bound stencil kernel,we leveraged spatial and temporal blocking with a register-level communication based data exchange scheme to improve the arithmetic intensity,double buffered the data to hide memory access latency,and finally got over 70% of the optimal performance.Based on our results,we found that:(1)For the memory-bound scientific kernels,and some of the compute-bound kernels with an arithmetic intensity below 33.84,we need to design register-level communication based data sharing schemes and parallel algorithms to overcome the weakness of memory access.(2)For the compute-intensive scientific kernels,the key to improving the pipeline efficiency is tuning the assembly code,including re-scheduling the instructions and improving the dual-issue rate,etc.

Keywords/Search Tags:

Sunway TaihuLight, SW26010 many-core procesor, GEMM, Stencil, Register-level communication, Assembly code optimization, Memory access optimization

PDF Full Text Request

Related items

1	Porting And Optimization Of OpenFOAM On The Sunway Taihulight Supercomputer
2	The Design And Optimization Of High-performance Molecular Dynamics Algorithms On The Sunway TaihuLight Supercomputer
3	Porting And Optimizing GTC-P Code On Sunway TaihuLight Supercomputer
4	Research On Directive-based Parallel Language For Sunway Taihulight Supercomputer And Design Of The Compiling Optimization
5	Implementing Molecular Dynamics Simulation On The Sunway TaihuLight System With Heterogeneous Many-Core Processors
6	Parallel Implementation And Performance Optimization For Refactoring GROMACS On The Sunway Many-core Architecture
7	Study Of The Parallel Task Graph Scheduling Optimization On The Sunway Taihulight
8	Parallel Implementation And Performance Optimization For FHI-aims On The Sunway Many-core Architecture
9	Implementation And Optimization Of HPCG On Multi-core And Many-core Platform
10	Optimizations Of Memory-access For Stencil Computations On Shared-memory Multi-core Processor