Font Size: a A A

A Study Of Model Compression Approaches To Deep Learning-based Sequence Models

Posted on:2021-04-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:H S DingFull Text:PDF
GTID:1368330602994257Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In the past several years,deep learning based methods have greatly improved the performances of many sequence recognition tasks including optical character recogni-tion(OCR)and automated speech recognition(ASR).However,the excellent recog-nition accuracies of these deep learning models come along with a large number of parameters and massive computation cost due to the use of deep and complex network structures.To deploy deep learning based sequence recognition models in products on CPU server,there is an urgent need to compress and accelerate them as much as pos-sible.This thesis focuses on model compression and acceleration approaches for two state-of-the-art sequence recognition models including(a)integrated convolutional neu-ral network(CNN)and deep bidirectional long short-term memory(DBLSTM)based character models(a.k.a.CNN-DBLSTM)for OCR,and(b)LSTM based acoustic mod-els trained with a connectionist temporal classification(CTC)criterion for ASR.Firstly,this thesis proposes to compress and accelerate convolutional layers within CNN-DBLSTM models for OCR with knowledge distillation(KD)and Tucker decom-position.We use KD to transfer the knowledge of a large-size teacher model to a small-size compact student model,followed by Tucker decomposition to further compress the student model.For KD method,instead of conventional cross entropy(CE)based cri-terion,based on the architecture of CNN-DBLSTM models,we propose a distillation objective function that directly matches the feature sequences extracted by the CNNs of teacher and student with the guidance of a following LSTM layer.For Tucker decom-position method,we treat kernels of a CNN layer as a 4-way tensor,and approximate pre-trained CNN layer with Tucker decomposition.Experimental results on large scale English handwritten and printed OCR tasks show that using the proposed model com-pression method offers a good solution to building compact CNN-DBLSTM based char-acter models which can reduce significantly the number of model parameters,footprint requirement and inference latency yet without degrading recognition accuracy.Secondly,this thesis proposes to build compact CNN-DBLSTM based character models using a neural architecture search(NAS)approach.Based on a FairNAS frame-work,the topologies of CNN and DBLSTM parts within CNN-DBLSTM are jointly searched.Besides,several search space designing methods are investigated to study their influences on the performance of NAS.Experimental results on large scale En-glish handwritten tasks show that by designing appropriate search space,this method can obtain compact CNN-DBLSTM character models that achieve a better trade-off be-tween recognition accuracy and inference latency compared with manually designed compact CNN-DBLSTM character models.Lastly,this thesis studies KD techniques to transfer knowledge from CTC-trained DBLSTM-based acoustic models to small-size DBLSTM and deep unidirectional LSTM(DLSTM).Conventional CE based criterion makes an implicit assumption that teacher and student share the same frame-wise alignments.However,alignments learned by teachers can be inaccurate and unstable due to the lack of fine-grained align-ment guidance during CTC training.This thesis proposes to handle this alignment-inconsistent issue from two different perspectives.The first one proposes to design KD criteria with more appropriate alignment assumptions.Two KD criteria are pro-posed,namely dynamic frame-wise distillation(DFD)and segment-wise N-best hy-potheses imitation(SegNBI).The second one proposes to build powerful teacher models with more accurate and stable alignments using a novel alignment-consistent ensemble(ACE)technique,where all models within an ensemble are jointly trained along with a regularization term to encourage consistent and stable alignments.Experimental results on large vocabulary continuous speech recognition(LVCSR)tasks show that when a single DBLSTM model serves as teacher,DFD and SegNBI achieve mild performance improvements for small-size DBLSTM students compared with the conventional CE based method.With ACE models as teachers,ACE learns a more accurate alignment,which avoids the necessity of designing complex KD criteria such as DFD and SegNBI,and allows simple CE to achieve satisfactory results.Moreover,a simple target delay(TD)technique is proposed to handle the alignment difference between DBLSTM and DLSTM models.Experimental results show that with TD,KD can effectively transfer knowledge from DBLSTM ACE teachers to DLSTM students.
Keywords/Search Tags:Deep learning, Sequence model, Model compression, Knowledge distillation, Tucker decomposition, Neural architecture search, Dynamic frame-wise distillation, DFD, Segment-wise N-best hypotheses imitation, SegNBI, Alignment-consistent ensemble, ACE
PDF Full Text Request
Related items