Font Size: a A A

Study Of ParaC Compiler Supporting Deep Learning Operator Parallel Algorithm Optimization

Posted on:2021-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:H YinFull Text:PDF
GTID:2518306032967039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the wide application of deep learning in various fields has brought tremendous progress to our lives and production.But deep learning is based on huge training data and huge consumption of computing resources.Moreover,the algorithm is the core of deep learning,so the acceleration of deep learning algorithms is particularly important.With the advancement of large-scale parallel architectures of GPUs by companies such as NVIDIA,GPUs have become the mainstream acceleration platform for deep learning optimization,but due to the complex architecture of GPUs,program optimization based on GPU platforms faces optimization difficulties,complexity,and low efficiency Challenge.Academia and industry have developed some deep learning compilation optimization frameworks.They support typical operators,face many typical core platforms,parallel optimization and automatic tuning.But the existing tools still have many problems that cannot be solved.Based on the ParaC compiler,this paper implements a compiler which can optimize the deep learning algorithm in parallel and generate high performance CUDA code on GPU platform.The ParaC language is extended to support complex nested loop structure and matrix types higher than two dimensions;the compiler's parallel analysis and optimization capabilities are improved,and the back end of CUDA code generation is provided;the open optimization interface is designed and implemented,which provides OPI guidance methods,supports explicit optimization strategy optimization;OPI guidance and runtime functions for data flow optimization are provided,which supports multiple Data flow optimization in mixed language programming.Two typical algorithms are selected to evaluate the performance of ParaC compiler.First,write out the ParaC version code of the algorithm,output the CUDA version code through OPI guidance tuning and compiler,and finally compare the performance of the generated version of ParaC with the optimized version of manual version and high-performance algorithm library.In general,the performance of the ParaC version is the same as that of the manual optimized version.Bucket sorting is in the range of array size from 10000 to 200000,and its performance is higher than thrust algorithm library.For batch normalization algorithm,taking resnet-50 network as an example,under the scale of 128 batches,the performance of large pictures is faster than that of cudnn algorithm library,and the performance of small pictures is slightly lower than that of algorithm library,but better parallel optimization strategies can be found.For its capacity,the number of lines of code in ParaC version is far lower than that in CUDA version,and because OPI guidance provides a tuning interface,it greatly reduces the tuning work of developers.
Keywords/Search Tags:Deep learning operator, compiler, CUDA, OPI guidance, data flow, optimization
PDF Full Text Request
Related items