In recent years,deep neural networks have made the most advanced achievements in many computer vision tasks.However,such excellent performance often depends on the complex network structures,and the model parameters up to millions or even billions.Training these large networks requires high computing power and time cost.The application of the overparameterized model in mobile devices is limited due to its huge deployment cost,and large networks have no advantage in tasks requiring high real-time performance.Knowledge distillation is an effective model compression and knowledge transfer technology.It is built on the framework of teacher-student model,and transfers the knowledge of a large teacher network to a relatively small student network by allowing the student to match the output or effective features of the teacher,so as to obtain a lightweight network with excellent accuracy.However,the traditional teacher-student distillation framework has the dilemma of ability mismatch.As the size of teacher network continues to increase,the performance of student network firstly improves and then decreases.This indicates that a large ability gap between the teacher and student will damage the effect of knowledge distillation,and the teacher with larger capacity and higher accuracy may not lead to better performance of the student.Aiming at the above dilemma of traditional knowledge distillation,this research proposes a knowledge distillation method based on multiple homogeneous teacher networks,MHT-KD.This method replaces a traditional large teacher with a group of small teacher models with the same structure as student,and the student model learns from the knowledge jointly provided by the teacher group,so as to alleviate the negative effect caused by the ability gap between the teacher and student model.The proposed method mainly consists of two stages.In the first stage,each teacher member of the teacher group is initialized and pre-trained respectively to ensure that each teacher member can provide diverse knowledge.In the second stage,each member of the teacher group simultaneously and independently transfers knowledge to the student network,and there is a one-to-one teaching relationship between each teacher member and the student.As the structure of the teacher member and student is the same,the large capacity gap between the teacher member and the student is alleviated,and knowledge can be effectively transferred between them.Based on the multiple homogeneous teacher framework,this research further proposes a confidence-adaptive initialization strategy based on teacher group’s confidence.According to the teacher group’s confidence in the training process,different initialization methods are adaptively used for the student network.When the teacher group is overconfident,the student network will use the normal initialization method.Otherwise,a teacher member will be selected from the teacher group,and all its model parameters will be used to initialize the student network.In order to further improve the performance of the student model,this research designs a feature similarity loss function based on the network classification layer,which indirectly affects the validity of features extracted from the network by constraining the similarities of parameters of the classification layer.The loss function can not only be applied to basic image classification tasks,but also be compatible with multiple homogeneous teacher knowledge distillation framework.MHT-KD is composed of three parts:the homogeneous teacher group,the adaptive initialization strategy based on the confidence of teacher group,and the feature similarity loss function based on the network classification layer.This research conducts extensive experiments on different image classification datasets and network models.Compared with classical knowledge distillation,MHT-KD has improved the student accuracy significantly.Compared with other advanced knowledge distillation methods,MHT-KD also shows consistent accuracy advantages.The knowledge distillation method proposed in this research can effectively alleviate the negative effects caused by the large ability gap between the teacher and the student,and improve the accuracy of student,which has certain guiding significance for the study of efficient knowledge transfer mode under the teacher-student framework. |