| Deep learning algorithms as the representative of artificial intelligence(AI)algorithms have developed rapidly in recent years.In different research fields such as computer vision,natural language processing,and competitive games,the intelligence demonstrated by deep learning algorithms has approached or even exceeded the level of human beings.Traditional computing systems based on general-purpose processors fail to meet the demand of deep learning algorithms.Emerging AI computing systems have become an important physical platform for deep learning algorithms due to their excellent performance and power consumption.However,the rapid development of deep learning algorithms and the continuous innovation of AI chips pose a great challenge to the versatility of AI computing systems.This dissertation investigates from different levels of the system software in AI computing systems,take the tensor operations in deep learning algorithms and AI chips as typical features,and explore the crossplatform compilation technology for AI computing systems.The major contributions and innovations include:(1)To address the cross-platform code generation,we propose a unified abstract machine model for AI computing systems.The key observation is that highly diverse AI chips share multiple key architecture characteristics for tensor operation,which can be generalized for conducting cross-platform code generation.Therefore,we propose a tensor abstract machine for AI chips.Based on the tensor abstract machine,we further propose the tensor abstract instruction set and design a tensor intermediate representation that includes scalar operations,vector operations,and tensor operations.Moreover,we design a two-stage code generation algorithm to complete the code generation from the unified tensor intermediate representation to different AI chips.Experimental results show that the tensor abstract machine can be instantiated to 4 typical AI chips:DianNao,TPU,VTA,and GPU Tensor Core.Compared with the traditional scalar intermediate representation,our tensor intermediate representation can improve the programming efficiency by an average of 2.68×,and enhance the versatility of the code generation module.(2)To address the cross-platform compilation optimization,we propose to enable one-size-fits-all tensor compilation optimization for AI computing systems.Based on the common operational characteristics of different tensor operations in the AI computing systems,we propose the basic paradigm of tensor operations and design the tensor computation description.Then,we introduce 5 basic tensor schedule primitives that are derived from the common tensor compilation optimization:tensor decomposition,special convolution optimization,and pipeline optimization.To validate our proposal,we conduct experiments on 3 commodity AI chips including GPU with Tensor Cores,VTA,and TPU.Experimental results demonstrate that the code generated from the same optimization schedule achieves 1.05× to 2.05× better performance than hand-tuned libraries and deep learning compilers across different platforms.(3)To address the cross-platform porting of user programs,we propose a source-tosource neural compiler,Codeformer,for AI computing systems.Codeformer consists of three parts:pre-training,back-translation,and discriminative reranking,which can complete automatic program translation for AI computing systems from existing user programs.We take C language as the representative of the serial programming model in traditional computing systems and CUDA language as representative of the parallel programming model in AI computing systems to validate our proposal.We create the first large-scale C-to-CUDA dataset for learning to translate from C to CUDA.Experimental results show that,compared with the traditional auto parallelization methods and statistical-based machine translation methods,Codeformer can significantly improve the evaluation metric for machine translation(i.e.,BLEU,CodeBLEU,and ParaBLEU)and the compilation accuracy of generating programs.Furthermore,the CUDA code generated by Codeformer attains a speedup of up to 347 × over the sequential C code,and the developer productivity is improved by at most 3.8×.Codeformer proposes a machinelearning-based automatic parallelization program translation method,which provides a good starting point for future work on the topic. |