| In recent years,the emergence of pre-trained language models(PLMs)has brought natural language processing(NLP)into a new era.Pre-training language models then finetuning on downstream tasks has become a new paradigm for NLP.PLMs such as the BERT model based on the Transformer structure have achieved great success in various NLP tasks.However,these PLMs suffer from heavy model size and high latency,hindering their deployment on resource-constrained edge devices.This thesis focuses on compression and acceleration of PLMs to facilitate their application in practice.To improve the performance of student models that only perform knowledge distillation(KD)in the fine-tuning stage,this thesis proposes a novel two-stage KD framework for compressing PLMs,named TinyBERT,which performs KD at both the pre-training and fine-tuning stages.To facilitate the plenty knowledge encoded in a large "teacher" BERT to be effectively transferred to a small "student",TinyBERT further proposes Transformer Distillation which is a specially designed KD method for the Transformer-based models.By performing Transformer Distillation at both stages,TinyBERT can capture general-domain as well as task-specific knowledge in BERT.In addition,a new data augmentation procedure is proposed in the fine-tuning stage to improve the generalization ability of TinyBERT on specific tasks.Experimental results show that a 4-layer 312-dimensional TinyBERT4 achieves more than 96.8%performance of its teacher BERTBASE on the GLUE benchmark,while being 7.5x smaller on size and 9.4x faster on inference.Besides,the TinyBERT4 outperforms BERT4-PKD and DistilBERT4 by a margin of at least 4.4%,with 28%parameters.Moreover,a 8-bit TinyBERT4 has only a slight performance drop compared with the fullprecision model,while being 30x smaller than the BERTBASE teacher.To cope with the poor performance caused by the heuristic layer mapping strategy,this thesis proposes an efficient approach,called Evolved Layer Mapping(ELM),to discover better strategies for improving task-agnostic BERT distillation.Specifically,ELM is armed with the genetic algorithm(GA)and acts as an evolutionary search engine to iteratively provide layer mapping strategies and explore better off-springs given their performance.To accelerate the search process,ELM further designs a proxy setting where a small portion of the full training corpus are sampled for distillation,and three representative tasks are chosen for evaluation.Experimental results on the GLUE benchmark show that the optimal layer mapping strategy from ELM consistently outperforms heuristic ones,and its corresponding student model with 60%parameters achieves 99.4%performance of the teacher BERTBASE,while being 2x faster on inference.To effectively compress multilingual PLMs with richer patterns,this thesis proposes a hybrid approach,called LightMBERT,which combines two types of compression techniques:pruning and knowledge distillation.LightMBERT obtains the student model in KD by structured pruning the Multilingual BERT(mBERT),which enable the student model to have preliminary cross-lingual knowledge at the beginning.Subsequent KD makes the student model re-acquire the lost knowledge caused by structured pruning.Zero-shot experimental results on XNLI show that under the same compression ratio,LightMBERT outperforms the baselines by a margin of at least 2.1%on average and performs on-par with its teacher mBERT.Moreover,in another test scenario where both source and target language annotated data are available,LightMBERT demonstrates that a teacher model fine-tuned on the mixed data can achieve better cross-lingual ability,and performing KD with the teacher can further improve the performance of a student model on the target language. |