Font Size: a A A

Research On Efficient Knowledge Distillation Methods

Posted on:2022-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y L PeiFull Text:PDF
GTID:2568306326476794Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Knowledge Distillation(KD)is a simple but effective method for model compression,which uses the knowledge from a well-trained teacher network(a large neural network)to assist in training the student network(a small neural network),thereby effectively improving the performance of the student network.Although knowledge distillation has made significant progress,there are still many difficulties:(1)Generally,if there is a large gap in capacity between the teacher network and the student network,the student network(Student)cannot effectively learn the knowledge from the teacher network(Teacher),leading to poor classification performance of Student;(2)There are a lot of related work on different types of Teacher’s knowledge,but what type of Teacher’s knowledge is more conducive to Student’s learning is still a difficult point;(3)The misinformation in Teacher may be harmful to the learning of Student;(4)Online distillation can cause the serious homogeneity which will hinder the further improvement of Students’ performance.In order to solve the above-mentioned issues,we focus on efficient knowledge distillation methods for image classification.The main contributions are as follows:(1)Self-boosting Feature Distillation is proposed.This method addresses the problem of the Teacher-Student gap from a new perspective,that is,Student’s own information is used to improve Student’s learning ability,thereby alleviating the gap between Teacher and Student.Integrated features are conducted to imitate Teacher’s original features.And a new self-distillation strategy is proposed,which only uses the parameters of Student’s previous epoch to update Student’s parameters without increasing memory usage and forward propagation.Moreover,the effectiveness of self-boosting feature distillation is explained by Richardson extrapolation,and self-boosting feature distillation improves the convergence order of Student.Extensive experiments show that the proposed method has excellent distillation performance,which is significantly better than current state-of-theart knowledge distillation methods.(2)We propose an online distillation method based on contrastive learning.In view of the serious homogeneity among Students in online distillation,the similarity between samples is treated as the knowledge for Students to learn from each other to mitigate homogeneity.Moreover,a new loss function is designed to appropriately improve the difference among Students.In addition,we use the ensemble of multiple Students’ outputs as the Teacher and the rest of the network as the leader,and an additional self-distillation loss is conducted on the leader network to alleviate the Teacher-Student gap.The experimental results show that our method effectively improves the performance of the leader and the integrated Teacher.(3)We propose an offline distillation method based on parameter-free loss estimation.In view of the difficulties in optimizing the classical distillation loss and determining temperature hyperparameters,four new parameter-free losses based on information normalization are designed,which greatly improves the distillation effect.Besides,considering that knowledge distillation is essentially a function fitting problem on discrete data points,in order to improve the density and richness of the data,we propose an intra-class neighborhood sampling strategy to enable Student to capture richer knowledge from Teacher.Experiments on multiple datasets show that the proposed method greatly improves the performance of Student.
Keywords/Search Tags:Self-boosting, Feature Distillation, Online Distillation, Offline Distillation
PDF Full Text Request
Related items