Font Size: a A A

Optimization Of Convolution Based On CPU SIMD Instruction Set

Posted on:2022-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhuangFull Text:PDF
GTID:2518306773471204Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of deep learning,the forward inference of deep learning models has gradually been deployed on different devices.Because the hardware characteristics of different devices are quite different,the forward inference of common deep learning models cannot maintain better performance on different devices at the same time,which means that the forward inference of deep learning models has poor performance portability on different hardware devices.In addition,for the most time-consuming convolutional layer in forward inference,since the input matrices of different convolutional layers have different shapes,especially some input matrices are skinny tall matrices,the optimization of common deep learning libraries is not suitable for all input shapes.So these libraries cannot obtain excellent performance on all convolution layers,that is,the optimized convolution implementation has poor performance portability on different input convolution shapes.We present Fast Conv,a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes.Fast Conv is implemented based on Winograd fast convolution algorithm and the convolution algorithm based on im2 col transform and General Matrix Multiplication.ARM CPUs cover a wide range of designs and specifications,from embedded devices to HPC-grade CPUs.This leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes.Fast Conv addresses this problem by using templates to autogenerate multiple shapes of tuned kernels variants suitable for skinny tall matrices.As a performance portable library,Fast Conv transparently searches for the best combination of kernel shapes,cache tiles,scheduling of loop orders,packing strategies,access patterns,and online/offline computations.Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size.Results show 1.02 x to 1.40 x,1.14 x to 2.17 x,and 1.22 x and 2.48 x speedup is achieved over NNPACK,ARM NN,and Feather CNN on Kunpeng 920.Furthermore,performance portability experiments with various convolution shapes show that FastConv achieves 1.2x to 1.7x speedup and 2x to 22 x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920.CPU performance portability evaluation on VGG–16 show an average speedup over NNPACK of 1.42 x,1.21 x,1.26 x,1.37 x,2.26 x,and 11.02 x on Kunpeng 920,Snapdragon 835,855,888,Apple M1,and AWS Graviton2,respectively.With all the above experimental results,Fast Conv can show the good performance portability different input convolution shapes over different ARM CPUs.
Keywords/Search Tags:Convolution, Deep Learning, Parallel Computing, Auto Tuning
PDF Full Text Request
Related items