| In the current research environment,complex models usually perform better on datasets than lighter models.Therefore,many researchers spend a lot of energy on how to design more complex and exquisite models.However,such research often encounters the problem of resource constraints when applied.After all,although we can use computing resources as much as possible when training the model,the computing resources that can be called to the program are still limited in the actual application environment.Therefore,in order to make the research model effective in the actual environment,various model compression technologies have been developed,which can reduce the computational resources occupied by the model without losing the accuracy of the model.As a research hotspot in model compression,knowledge distillation solves the above problems well.It not only greatly compresses the size of the model,but also minimizes the performance loss caused by model compression.The whole compression process is completed only through one model distillation training,which is more simple than other methods.Generally speaking,on the same dataset,with the current model learning method,it is difficult for the small model to achieve the same effect as the large model.However,knowledge distillation breaks this shackle.It borrows the idea of teachers teaching students and uses the large model as the teacher model to guide the small model learning,so as to make the small model reach a new height.The performance improvement of the small model after knowledge distillation is mainly related to the performance of the teacher model and the way of knowledge distillation.Therefore,it is necessary to select the teacher model with excellent performance before distillation training.The main research scene of this paper is in the field of Sentiment classification,so the teacher model used for distillation is naturally BERT model that shows its strength in many NLP tasks.At present,there are two distillation ideas about the knowledge distillation of BERT model: the first is to distill BERT into a smaller BERT model,in which the student model still maintains the structure of Transformer;The other idea,which is also the choice of this paper,is to distill the BERT model into heterogeneous models such as Bi-LSTM model.Although the performance of the student model is not as good as the former,in some environments with extremely limited resources,if the accuracy of the distilled Bi-LSTM model can meet the application requirements,the Bi-LSTM model with less parameters is obviously more practical.The main contribution of this paper is that under the existing BERT model distillation schemes,combined with the ideas of traditional knowledge distillation schemes such as Factor Transfer and Similarity-Preserving Knowledge Distillation,this paper puts forward two experimental schemes to distill the BERT model into Bi-LSTM model: BERT to Bi-LSTM with FT and BERT to Bi-LSTM with SPKD,which are tested in SST-2 and IMDB datasets in the field of sentiment classification,It is compared with Tiny BERT scheme and distilled Bi-LSTM scheme proposed by Tang.The final experimental results show that BERT to Bi-LSTM with SPKD performs better than other BERT to Bi-LSTM distillation schemes on both SST-2 and IMDB datasets,and only slightly worse than Tiny BERT scheme when the parameter is much less than Tiny BERT.Such experimental results well verify the importance of inter sample information in the distillation process.At present,the distillation scheme of BERT model just lacks the use of such information.Only focusing on the distillation of a single sample limits the performance improvement of student model.Subsequent research can pay more attention to the mining of inter sample information.Finally,the code involved in this paper have been put into Git Hub.The specific links are: https://github.com/bestahao/knowledge_distillnation. |