Font Size: a A A

Research On Speaker Recognition Technology Based On Deep Learning

Posted on:2022-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ShaoFull Text:PDF
GTID:2518306575464864Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the popularization and use of various smart terminals,traditional identity authentication methods have been unable to meet people's needs.With the development of pattern recognition and other technologies,various biometric technologies have been developed.Speaker recognition,as one of them,has the advantages of convenient collection,high user acceptance,and strong universality.There are a large number of applications in the field of military,public security and business.Speaker recognition can be divided into speaker verification and speaker identification according to the task.This thesis studies text-independent speaker verification.The main tasks are as follows:1.This thesis introduces the Spec Augment data augmentation method commonly used in speech recognition.It has a fast calculation speed and is convenient for online augmentation.The experimental results on Vox Celeb1 show that Spec Augment is still effective for speaker recognition tasks.This thesis compares Spec Augment and x-vector methods.Under the three losses of cross entropy,AM-Softmax and MSE,the experimental results show that the more complex x-vector data augmentation is more effective for speaker recognition tasks.2.Based on the Res Net34 model,this thesis modifies the convolution kernel size to3x3,and adjusts the residual block configuration to [3,3,3] to obtain a Res Net model with less complexity.Under the cross-entropy loss,the Res Net model achieves an equal error rate of 5.1% on Vox Celeb1,which is better than the i-vector and x-vector benchmark models.At the same time,the network structure experiment also shows that when the parameters are constant,the architecture performance is better.3.This thesis proposes a training method based on knowledge distillation technology,which uses MSE loss to constrain the difference between Res Net speaker characteristics and i-vector.At the same time,this method can be regarded as a method of unsupervised training of neural networks.Using MSE loss,the Res Net model can achieve an equal error rate of 4.7% on Vox Celeb1,which is better than the performance of its teacher model ivector,indicating that deep neural networks have better generalization capabilities.In addition,this thesis also proposes a method based on joint training,which is more efficient than the joint method of model integration.By combining the joint loss with AM-Softmax loss and MSE loss,the equal error rate can be further reduced to 3.229%,which is better than most current models.And the experimental results also show that the AM-Softmax loss helps to improve the performance of the model under the cosine scoring,and the MSE loss helps to improve the performance of the model under the PLDA scoring.
Keywords/Search Tags:speaker recognition, ResNet, i-vector, data augmentation, knowledge distillation, joint training
PDF Full Text Request
Related items