Font Size: a A A

End-to-End Speaker Embedding For Speaker Recognition In The Wild

Posted on:2022-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q YuFull Text:PDF
GTID:2518306725993269Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the core technology of speaker recognition,speaker embedding is to extract a vector representing the speaker identity from the speech utterance.In the past few years,more and more researchers are focusing on deep learning-based end-to-end speaker embedding models instead of statistical models.Meanwhile,improving the accuracy of speaker recognition in the wild remains a major challenge for researchers,since the speech environments are complicated,noisy and variable in the wild.This paper analyses the recent works on end-to-end speaker embedding from the aspects of network and loss function.For speaker recognition in the wild,this paper proposes three methods for end-to-end speaker embedding with regard to the backbone,denoising strategy and loss function.1.Speech environments are complicated in the wild.Previous speaker embedding models either require high computational cost or have weak fitting ability.This paper proposes end-to-end speaker embedding backbones based on densely connected time delay neural network(D-TDNN).The D-TDNN model adopts time delay neural network to efficiently process speech data,and it adopts pyramid structure,bottleneck layer and identity mapping to reduce its parameters and strengthen its fitting ability in the wild.Experiments show that the D-TDNN model can improve the accuracy of speaker recognition in the wild with fewer parameters,compared to previous end-to-end speaker embedding models.2.Speaker environments are noisy in the wild.Previous methods usually rely on speech enhancement models which have high computational cost,or denoising backends which have a shortage of information.For speaker recognition with end-to-end speaker embedding,this paper proposes a denoising strategy based on context-aware masking(CAM).CAM employs feature map masking under the guidance of global context for denoising,and it shares the backbone with the speaker embedding model to avoid redundant computation.Experiments show that CAM can improve the accuracy of speaker recognition in the wild with low computational cost,by empowering end-toend speaker embedding model to denoise.3.Speech environments are variable in the wild.Previous end-to-end speaker embedding models optimized by Softmax loss are not discriminative enough.This paper introduces large margin cosine loss(LMCL)for optimizing D-TDNN models and thoroughly analyses the mechanism of LMCL.LMCL improves the robustness and the efficiency of gradient descent by adopting cosine similarity as the metric,and it exploits the examples with medium difficulty by introducing margin.Experiments show that D-TDNN models optimized by LMCL instead of Softmax loss are more discriminative and can improve the accuracy of speaker recognition in the wild.
Keywords/Search Tags:Speaker recognition, speaker embedding, time delay neural network, masking, large margin cosine loss
PDF Full Text Request
Related items