Font Size: a A A

Mixed-precision Quantization Methods For Convolutional Neural Network Compression

Posted on:2021-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y K BaoFull Text:PDF
GTID:2518306503480274Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the successful application of deep convolutional neural networks(DNNs)in various computer vision tasks,researchers design deeper or wider networks to surpass existing classic methods and achieve better performance.Most state-of-the-art convolutional networks need tens of megabytes of weight storage and billions of floating-point operations to perform a forward inference,which makes them difficult to deploy widely in resource constraint edge devices.Quantization is considered as one of the most effective ways to meet the memory requirements of edge devices.Quantization reduces the model size by replacing 32-bit floating-point numbers in weights,activations and gradients with lower bit-width representations.However,most quantization methods assign universal bit-width to all network layers.When compressing deep neural networks to very low precision,some sensitive layers may severely reduce network accuracy.Therefore,a better strategy is to adopt a heterogeneous bit-width allocation scheme.This research topic is also known as mixed-precision quantization.Related works in the area of mixed precision quantization have many disadvantages,such as high complexity and uncertainty in bit-width allocation.Our research focuses on analysis of local quantization noise,then connects layer importance with dynamic quantization sensitivity.Besides,based on the premise of equivalence between quantization noise and tiny perturbation near local equilibrium point,our method assigns bit-width by successively decreasing representation precision of each layer,so that the final bit-width allocation scheme is unique.We prove that feature maps amplify small quantization perturbations of weights,and that network accuracy degradation is directly due to the difference in feature maps between layers.We propose that layer-wise quantization should be in aims to reconstruct the feature map and adjust the quantization centroids obtained by the traditional quantizer.We make approximate estimates of the quantized feature map errors and iteratively optimize them by the alternating direction multiplier method.Based on the analysis of single-layer quantization noise,we propose quantization sensitivity measurement under small perturbation.The lower the quantization sensitivity,the higher the quantization priority.The entire weight bit-width allocation algorithm is mainly based on the stepwise precision reduction under guidance of ”feature map alignment”.Once the compression ratio is reached,the bit-width allocation process stops.At the same time,the bit-width allocation scheme under any lower compression ratio condition can be derived according to the history log recorded.Given characteristic function of quantization error,we further propose a framework for activation bit-width allocation under constraint of weight precision.Experiments on mainstream neural networks show that our method achieves better result than related works.
Keywords/Search Tags:Mobile multimedia, Compression, Quantization, Bitwidth scheme
PDF Full Text Request
Related items