End-to-End Speaker Embedding For Speaker Recognition In The Wild

Posted on:2022-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Yu

Full Text:PDF

GTID:2518306725993269

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As the core technology of speaker recognition,speaker embedding is to extract a vector representing the speaker identity from the speech utterance.In the past few years,more and more researchers are focusing on deep learning-based end-to-end speaker embedding models instead of statistical models.Meanwhile,improving the accuracy of speaker recognition in the wild remains a major challenge for researchers,since the speech environments are complicated,noisy and variable in the wild.This paper analyses the recent works on end-to-end speaker embedding from the aspects of network and loss function.For speaker recognition in the wild,this paper proposes three methods for end-to-end speaker embedding with regard to the backbone,denoising strategy and loss function.1.Speech environments are complicated in the wild.Previous speaker embedding models either require high computational cost or have weak fitting ability.This paper proposes end-to-end speaker embedding backbones based on densely connected time delay neural network(D-TDNN).The D-TDNN model adopts time delay neural network to efficiently process speech data,and it adopts pyramid structure,bottleneck layer and identity mapping to reduce its parameters and strengthen its fitting ability in the wild.Experiments show that the D-TDNN model can improve the accuracy of speaker recognition in the wild with fewer parameters,compared to previous end-to-end speaker embedding models.2.Speaker environments are noisy in the wild.Previous methods usually rely on speech enhancement models which have high computational cost,or denoising backends which have a shortage of information.For speaker recognition with end-to-end speaker embedding,this paper proposes a denoising strategy based on context-aware masking(CAM).CAM employs feature map masking under the guidance of global context for denoising,and it shares the backbone with the speaker embedding model to avoid redundant computation.Experiments show that CAM can improve the accuracy of speaker recognition in the wild with low computational cost,by empowering end-toend speaker embedding model to denoise.3.Speech environments are variable in the wild.Previous end-to-end speaker embedding models optimized by Softmax loss are not discriminative enough.This paper introduces large margin cosine loss(LMCL)for optimizing D-TDNN models and thoroughly analyses the mechanism of LMCL.LMCL improves the robustness and the efficiency of gradient descent by adopting cosine similarity as the metric,and it exploits the examples with medium difficulty by introducing margin.Experiments show that D-TDNN models optimized by LMCL instead of Softmax loss are more discriminative and can improve the accuracy of speaker recognition in the wild.

Keywords/Search Tags:

Speaker recognition, speaker embedding, time delay neural network, masking, large margin cosine loss

PDF Full Text Request

Related items

1	Speaker Recognition Based On Additive Margin Loss
2	Research On Speaker Recognition Method Based On Deep Learning
3	Research On Loss Functions In Neural Networks For Speaker Recognition
4	Triplet Loss And Manifold Dimensionality Reduction Based Method For Text-independent Speaker Recognition
5	Research On Methods Of Improving The Representation Ability Of Speaker Recognition Models
6	Research On Speaker Representation Based On MG Training Criteria
7	Research On Speaker Recognition Clustering Algorithm
8	Research On Speaker Recognition Algorithm Based On Deep Convolutional Neural Network
9	Study On Rhesus Macaque Voice Print Based On Endto-end Model
10	Research On End-to-end Speaker Recognition Based On Raw Waveform