| In recent years,due to the availability of large-scale credible data,the tremendous ad-vancement in computing power of hardware,and the innovation of neural network architec-ture,deep neural networks(DNN)have achieved unprecedented success,and thus has been widely used in many fields,such as image classification,speech recognition,object detec-tion and natural language processing.As a result,DNN training is increasingly becoming a common and heavy workload in high performance Computing data centers.DNN training is a typical computation- and memory-intensive workload,and training process typically in-volves large data sets,a considerable amount of computations and long training time.Several emerging hardware accelerators,such as Graphic Processing Unit(GPU)and Field Program-ming Gate Array(FPGA),owing to tremendous computing power,provide opportunities to high-performance acceleration of DNN training.In particular,GPU has become the most important processing platform for DNN training due to its large-scale thread-level paral-lelism and high memory bandwidth.However,GPU is very power-hungry,resulting in high energy consumption of DNN training,which has become the major scalability bottleneck for applying GPU in the field of DNN training.Unfortunately,such bottleneck will only worsen as larger and more complex DNN models are developed and applied.Therefore,it is imperative to explore new methods for high-performance and energy-efficient DNN train-ing.How to both improve the performance of DNN training and effectively reduce energy consumption has become two major challenges facing DNN training.In order to cope with these two challenges,this paper combines the characteristics of high computing power and high energy efficiency of heterogeneous systems,and studies the key technologies of how to achieve high-performance and energy-efficient DNN training on heterogeneous systems.The main research work of this paper is summarized as follows:First,in order to better understand performance,power consumption and energy ef-ficiency behavior of DNN training.This paper conducts a comprehensive evaluation and in-depth analysis of DNN training in terms of performance,power consumption and energy efficiency on CPU and GPU,respectively.Several popular DNN models are used to con-duct a series of experiments on GPU and CPU,and the results of training power consumption and energy efficiency on CPU and GPU are compared and analyzed,as well as the results of training power consumption and energy efficiency of different GPUs.In addition,at the level of software factors,this paper studies the impacts of different types of neural network layers,batch size settings and numerical precision on training performance,power consump-tion and energy efficiency of model training.Moreover,in terms of hardware factors,this paper investigates the impacts of several hardware features available on CPU and GPU(e.g.,Hyper-Threading,Error Correction Code and Dynamic Volta Frequency Scaling)on train-ing performance,power consumption and energy efficiency of DNN training.Based on the above series of experimental evaluations,multiple important design implications and best practice principles are analyzed and summarized in this paper to facilitate and guide per-formance and energy efficiency optimization of DNN training on heterogeneous systems.Compared with the previous work,which mainly focused on the performance of DNN train-ing while ignoring energy efficiency,the evaluation and analysis of DNN training in this paper is more comprehensive,including the evaluation and analysis of power consumption and energy efficiency behaviors.Second,the high power consumption of GPU is a major reason for the high energy consumption of DNN training.To solve this problem,this paper explores how to achieve energy-efficient DNN training on GPU-FPGA based heterogeneous systems while main-taining no performance loss(training throughput and model training accuracy)by taking advantage of the high performance of GPU and low power consumption of FPGA.To this end,this paper proposes a Hybrid performance-aware energy-efficient training framework named Hype-training,which leverages the combination of offline characterization,perfor-mance modeling,and runtime scheduling.This paper first designs a DNN profiling frame-work,analyzes the power consumption and performance characteristics of DNN operations on GPU and FPGA respectively,and determines the DNN operations potentially suitable for running on FPGA according to the analysis results.Then,the corresponding FPGA kernel optimization is carried out for the candidate DNN operations on the FPGA side.In order to better optimize the FPGA kernel and reduce the optimization time cost,a lightweight per-formance model is proposed to quickly select the optimal parameter configuration combina-tion in the parameter optimization space.Moreover,a series of other effective optimization techniques are applied to optimize FPGA kernels including loop tiling,double buffering,input-adaptive kernel execution.Finally,for runtime scheduling,this paper proposes two fine-grained runtime scheduling strategies to meet different performance,power consump-tion and energy efficiency goals in two common use scenarios in data centers.Experimental tests using NVIDIA V100 GPUs and Intel S10 FPGAs show that,Hype-training is able to exploit a mixture of GPUs and FPGAs at a fine granularity to lower the energy consumption of DNN training significantly,by 44.3% on average and up to 59.7%,without any perfor-mance loss.In addition,Hype-training can also enforce power capping more effectively than the state-of-the-art power management mechanisms on GPUs.Third,starting with the Volta micro-architecture,Nvidia introduced Tensor Cores into its GPUs specifically for FP16 arithmetic computation.On GPU featuring tensor cores,adopting mixed precision training can significantly improve the performance and energy ef-ficiency of DNN training.However,the current mixed-precision optimizer adopts a greedy-based graph rewriting algorithms to execute all performance-critical operations with FP16 numerical precision without weighing the introduced casting overhead against the perfor-mance gain of the operation executed with FP16(compared to FP32),which usually results in suboptimal training throughput.To address this performance problem,this paper proposes a comprehensive and efficient mixed precision optimizer named Campo,which explicitly takes the casting cost into account when performing DNN dataflow graph rewriting,and can assign the optimal precision execution mode for the operations and minimize unnecessary casts to maximize performance.Campo leverages a lightweight yet accurate performance model to predict the impact of an associative casting overhead on the selection of the optimal execution mode for an operation at a given input data size.Aided by the performance model,a casting cost-aware graph rewriting algorithm is designed to assign optimal execution pre-cision to each operation at runtime.This paper evaluates Campo with six mainstream DNN models on both Nvidia Ge Force RTX 2080 Ti GPU(Nvidia Turing microarchitecture)and Nvidia Tesla V100 GPU(Nvidia Volta microarchitecture),The experimental results show that Campo does not affect the model training accuracy adversely,compared with the ex-isting approach,the training throughput on RTX 2080 Ti is improved by 20.8% on average(up to 24.5%),and the training throughput on V100 is improved by 20.4% on average(up to 23.1%),and the energy efficiency is also improved. |