| Recently,deep neural networks(DNNs)have developed rapidly,and are widely used in numerous areas.To handle more complex challenges and achieve higher model accuracy,the scale of deep neural networks is growing explosively.As a result,DNNs cost more and more computing and storage resources.It poses great pressure to hardware.To apply DNNs on resource-restricted devices,it’s necessary to simplify and accelerate these models.DNNs mainly consists of linear and nonlinear layers,this paper achieves model compression and performance improvement by compressing linear layers and accelerating nonlinear layers.Linear layers contain most of parameters of the model.It’s an effective way to simplify model by quantizing linear layers.These parameters of linear layers obey a Gaussian distribution.However,linear layers are quantized to low bit-width fixed-point numbers in mainstream methods,which may cause large error.Besides,fixed-point numbers are improper for layers with nonlinear computation such as transcendental functions.There are few studies on nonlinear layers at present.However,in recent Transformer models,the computation cost of nonlinear layers can’t be ignored.To solve these problems,this paper compresses linear layers and accelerates nonlinear layers in DNNs respectively.To quantize linear layers more precisely,this paper proposes a new 8-bits floatingpoint format:QFP8.Compared with current 8-bit formats,it can present values that obey a Gaussian distribution more precisely.Since data distributions change dramatically among layers,QFP8 format uses dynamic bias to adjust the range of data representation,which can reduce quantization error.Besides,statics of BatchNorm layers in quantized models are biased,and calibration on these statics can improve inference accuracy.The experiment shows that compared with other 8-bits format,QFP8 format can achieve a higher inference accuracy both on post training quantization and quantization aware training tasks.And calibration of statics of BatchNorm layers can further improve quantized model accuracy.To accelerate nonlinear layers,this paper proposes a method based on multisegment interpolation fitting.This method guarantees a controllable fitting error,and it can satisfy demand for accuracy of different tasks.What’s more,its computation complexity is merely O(1),therefore nonlinear layers can be accelerated effectively both in inference and training process.It only needs basic hardware instructions support,can be employed on servers and edge devices.The experiment results show that using this interpolation method to fit various nonlinear layers,it can achieve effective acceleration both on CPU and GPU.These interpolated nonlinear layers can be applied conveniently on various models,and inference and training process can achieve promised accuracy with a degradation of less than 0.5%.Combined with quantization of linear layers using QFP8 format,it can further accelerate inference process with promised model accuracy. |