Font Size: a A A

Optimizations Of Scientific Kernels On SW26010 Many-core Processor

Posted on:2019-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z G XuFull Text:PDF
GTID:2428330590967390Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Sunway TaihuLight,China's domestically designed and developed supercomputer,claims the top place in the latest TOP 500 list with the home-grown SW26010 many-core processor.In a radical departure from the traditional architectural design of CPUs and GPGPUs,the SW26010many-core processor reaches a peak performance of 3.06 TFlops,while its theoretical peak memory bandwidth is only 136 GB/s,resulting in an imbalanced flops per byte ratio.To bring out the potential of SW26010 for scientific computing,our research focused on comprehensive optimizations for the compute-bound as well as the memory-bound scientific kernels on the SW26010 architecture.First,we developed a micro-benchmark suite to evaluate the architectural characteristics of the pipelines,the memory hierarchy and the on-chip register-level communication mechanism.Second,to optimize general matrix multiplication(GEMM),a compute-bound kernel,we implemented a register-level communication based multi-level parallel algorithm,tuned the assembly code for the efficient kernel,and finally achieved over 90% of the peak performance for both DGEMM and SGEMM.To optimize the memory-bound stencil kernel,we leveraged spatial and temporal blocking with a register-level communication based data exchange scheme to improve the arithmetic intensity,double buffered the data to hide memory access latency,and finally got over 70% of the optimal performance.Based on our results,we found that:(1)For the memory-bound scientific kernels,and some of the compute-bound kernels with an arithmetic intensity below 33.84,we need to design register-level communication based data sharing schemes and parallel algorithms to overcome the weakness of memory access.(2)For the compute-intensive scientific kernels,the key to improving the pipeline efficiency is tuning the assembly code,including re-scheduling the instructions and improving the dual-issue rate,etc.
Keywords/Search Tags:Sunway TaihuLight, SW26010 many-core procesor, GEMM, Stencil, Register-level communication, Assembly code optimization, Memory access optimization
PDF Full Text Request
Related items