| With the improvement of electronic device performance and the rapid development of deep learning algorithms,speaker recognition systems based on deep learning have made significant progress and have surpassed previous methods based on statistical models in terms of recognition performance and robustness,becoming the focus of research in the field of speaker recognition.However,in many complex real-world scenarios,the existing speaker recognition models are still difficult to meet the standards for practical use on the ground.In order to further improve the recognition accuracy and robustness of speaker recognition models,this paper investigates the key techniques in deep learning-based speaker recognition models,and proposes a more effective frame-level feature extraction network and a loss function and training strategy for more adequate model training to address the shortcomings of existing speaker recognition models.On the one hand,in order to overcome the limitation that existing deep learning-based speaker recognition models cannot fully utilize the global importance weight distribution information of different frequency features in the input spectrum features,this paper proposes the Frequency Reweight Layer(FRL)structure.The proposed FRL can automatically learn the importance weights of each dimensional frequency feature from the training data,and use the obtained importance weights to enhance the input spectrum features in the frequency domain.Meanwhile,in order to further use FRL to enhance the intermediate layer features of the network,a Frequency Reweight Network(FRN)based on a multilayer FRL structure is proposed in this paper,and the optimal FRL layer configuration is retrieved by experiments.The experiments show that the proposed FRL can improve the recognition performance of the model with a negligible number of new parameters.In addition,this paper concludes that the low-frequency features in the input spectrum features are more important than the high-frequency features for speaker recognition models by visualizing and analyzing the weight coefficients learned by the FRL as well as the validation experiments.On the other hand,in order to solve the problem that the existing Margin-base Softmax loss lacks the ability of difficult sample mining and its optimization objective is not consistent with the actual evaluation index of speaker recognition task,this paper proposes a fusion loss function combining Mis-Classified Vector Guided Softmax loss(MVSoftmax)and Angular Prototypical loss based on the Few-Shot Learning framework.In addition,for the problem that the MVSoftmax loss function has difficulty converging at the early stage of speaker recognition model training,a two-stage training strategy is proposed in this paper.The experimental results show that the proposed loss function and training strategy can further improve the recognition performance and robustness of the baseline model.Finally,the proposed method is integrated into the baseline model and compared cross-sectionally with published state-of-the-art models on the Vox Celeb1 test set in this paper.The experimental results show that the performance metrics of the proposed model in this paper are 2.39% for EER and 0.25 for Min DCF when training the model using only the Vox Celeb1 original development set,which are 9.81% lower in EER and 16.94% lower in Min DCF compared to the baseline model,outperforming the performance of the existing state-of-the-art models.In the case of training the model using only the original Vox Celeb2 development set,the performance metrics of the model proposed in this paper are 1.35% for EER and 0.138 for Min DCF,which reduces EER by 12.33% and Min DCF by 2.13%compared to the baseline model,achieving comparable performance to the current state-of-the-art model.In addition,the model proposed in this paper has fewer number of parameters and higher application potential compared to the existing state-of-the-art models. |