Research On Speaker Recognition Technology Based On Deep Learning

Posted on:2022-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:Z X Shao

Full Text:PDF

GTID:2518306575464864

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,with the popularization and use of various smart terminals,traditional identity authentication methods have been unable to meet people's needs.With the development of pattern recognition and other technologies,various biometric technologies have been developed.Speaker recognition,as one of them,has the advantages of convenient collection,high user acceptance,and strong universality.There are a large number of applications in the field of military,public security and business.Speaker recognition can be divided into speaker verification and speaker identification according to the task.This thesis studies text-independent speaker verification.The main tasks are as follows:1.This thesis introduces the Spec Augment data augmentation method commonly used in speech recognition.It has a fast calculation speed and is convenient for online augmentation.The experimental results on Vox Celeb1 show that Spec Augment is still effective for speaker recognition tasks.This thesis compares Spec Augment and x-vector methods.Under the three losses of cross entropy,AM-Softmax and MSE,the experimental results show that the more complex x-vector data augmentation is more effective for speaker recognition tasks.2.Based on the Res Net34 model,this thesis modifies the convolution kernel size to3x3,and adjusts the residual block configuration to [3,3,3] to obtain a Res Net model with less complexity.Under the cross-entropy loss,the Res Net model achieves an equal error rate of 5.1% on Vox Celeb1,which is better than the i-vector and x-vector benchmark models.At the same time,the network structure experiment also shows that when the parameters are constant,the architecture performance is better.3.This thesis proposes a training method based on knowledge distillation technology,which uses MSE loss to constrain the difference between Res Net speaker characteristics and i-vector.At the same time,this method can be regarded as a method of unsupervised training of neural networks.Using MSE loss,the Res Net model can achieve an equal error rate of 4.7% on Vox Celeb1,which is better than the performance of its teacher model ivector,indicating that deep neural networks have better generalization capabilities.In addition,this thesis also proposes a method based on joint training,which is more efficient than the joint method of model integration.By combining the joint loss with AM-Softmax loss and MSE loss,the equal error rate can be further reduced to 3.229%,which is better than most current models.And the experimental results also show that the AM-Softmax loss helps to improve the performance of the model under the cosine scoring,and the MSE loss helps to improve the performance of the model under the PLDA scoring.

Keywords/Search Tags:

speaker recognition, ResNet, i-vector, data augmentation, knowledge distillation, joint training

PDF Full Text Request

Related items

1	Research On Privacy Protection Of Training Data Based On Knowledge Distillation
2	Speaker Adaptation Of DNN-HMM Acoustic Model For Speech Recognition
3	System Design And Robust Optimization Of Speaker Recognition Based On ASV-Subtools
4	Studies On Speaker Recognition Based On SVM And GMM
5	Research On Financial Text Generation Method Based On Knowledge Distillation And Pre-training Model
6	Research On Application Of Data Augmentation Based On Different Speech Habits In Speech Recognition In Telephone Scene
7	Research On Speaker Recognition Over Short Utterance And Varying Channels
8	Scene Text Recognition Based On Attention Mechanism And Knowledge Distillation
9	Research On Data Augmentation Method For Intention Identification
10	A Data Augmentation Approach For Annotating Web Table Columns By Knowledge Base Classes