Optimization Of Convolution Based On CPU SIMD Instruction Set

Posted on:2022-11-03

Degree:Master

Type:Thesis

Country:China

Candidate:C Zhuang

Full Text:PDF

GTID:2518306773471204

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the rapid development of deep learning,the forward inference of deep learning models has gradually been deployed on different devices.Because the hardware characteristics of different devices are quite different,the forward inference of common deep learning models cannot maintain better performance on different devices at the same time,which means that the forward inference of deep learning models has poor performance portability on different hardware devices.In addition,for the most time-consuming convolutional layer in forward inference,since the input matrices of different convolutional layers have different shapes,especially some input matrices are skinny tall matrices,the optimization of common deep learning libraries is not suitable for all input shapes.So these libraries cannot obtain excellent performance on all convolution layers,that is,the optimized convolution implementation has poor performance portability on different input convolution shapes.We present Fast Conv,a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes.Fast Conv is implemented based on Winograd fast convolution algorithm and the convolution algorithm based on im2 col transform and General Matrix Multiplication.ARM CPUs cover a wide range of designs and specifications,from embedded devices to HPC-grade CPUs.This leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes.Fast Conv addresses this problem by using templates to autogenerate multiple shapes of tuned kernels variants suitable for skinny tall matrices.As a performance portable library,Fast Conv transparently searches for the best combination of kernel shapes,cache tiles,scheduling of loop orders,packing strategies,access patterns,and online/offline computations.Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size.Results show 1.02 x to 1.40 x,1.14 x to 2.17 x,and 1.22 x and 2.48 x speedup is achieved over NNPACK,ARM NN,and Feather CNN on Kunpeng 920.Furthermore,performance portability experiments with various convolution shapes show that FastConv achieves 1.2x to 1.7x speedup and 2x to 22 x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920.CPU performance portability evaluation on VGG�16 show an average speedup over NNPACK of 1.42 x,1.21 x,1.26 x,1.37 x,2.26 x,and 11.02 x on Kunpeng 920,Snapdragon 835,855,888,Apple M1,and AWS Graviton2,respectively.With all the above experimental results,Fast Conv can show the good performance portability different input convolution shapes over different ARM CPUs.

Keywords/Search Tags:

Convolution, Deep Learning, Parallel Computing, Auto Tuning

PDF Full Text Request

Related items

1	Research On Auto-tuning Techniques For Parallel CFD Programs
2	Research On Parallel Optimizaion For Deep Learning Algorithms And Applications
3	Research On Key Technologies Of Wireless Communication Physical Layer Based On Deep Learning
4	The Tracking System For Ground Moving Objects Based On Deep Learning Framework
5	Auto Portrait Segmentation Based On Deep Learning
6	General-purpose Multi-core Cluster Parallel Tuning Strategy Research
7	Research On Driving Behavior Recognition Algorithm Based On Deep Learning
8	Night Vision Target Detection And Recognition Based On Deep Learning
9	Research On Image Fusion Algorithm Based On Deep Learning
10	Application And Research On Parameter Auto-tuning Fuzzy-PID Control Technique Based On Self-learning Technique