Font Size: a A A

Key Technologies Research On Speaker Recognition In The Wild

Posted on:2022-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:C F LuoFull Text:PDF
GTID:2518306569466104Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the development of deep learning,there is a paradigm shift of recent speaker recogni-tion studies,from traditional statistical based to deep learning based methods,which have sur-passed traditional methods in terms of recognition accuracy and robustness.However,speaker recognition in the wild is affected by complex acoustic environments such as background noise,reverberation,multiple speakers speaking at the same time and inconsistent speech duration,which makes the speaker recognition model unable to effectively extract features with identity discrimination capability from acoustic features and makes the model have poor results in recog-nition.To improve the accuracy and robustness of speaker recognition in the wild,this paper investigates key techniques in deep learning-based speaker recognition algorithms,and proposes a new multi-scale frame-level feature aggregation strategy and an ensemble loss function.First of all,a multi-scale feature aggregation strategy Ne Xt VLAD-MSA assembled with Ne Xt VLAD is proposed to enhance the ability of frame-level feature extractor to extract highly discriminative features from unconstrained speech.Ne Xt VLAD in the video analysis domain is introduced into the frame-level feature aggregation layer of the speaker recognition model,which can fully exploits the hierarchical time-frequency contextual information in the DCNN hidden layer and then aggregates them into utterance-level feature vectors.These utterance-level feature vectors are subsequently fused to generate speaker embeddings with speaker discrimina-tion capability.Experiments reveal that Ne Xt VLAD-MSA can outperform existing frame-level feature aggregation methods in the speaker recognition task.Then,in order to enhance the discriminative ability of the speaker recognition model as well as the training efficiency,this paper proposes to integrate the cosine-prototypical loss(CP-Loss)based on the few-shot learning framework with the margin-based Softmax loss into a assembling loss with complementary advantages,and train the model in the few-shot learning framework.The use of margin-based Softmax loss can effectively increase the inter-class distance and stabi-lize the training process,while the use of CP-Loss can directly optimize the speaker embedding space.The model trained with the proposed assembling loss function achieves better perfor-mance and robustness in speaker recognition in the wild.In order to verify the effectiveness and generalization performance of the assembling loss function,extensive experiments have been conducted in a variety of models.Experiments reveal that using the proposed assembling loss function can achieve better performance than using either of them alone,the EER is reduced by more than 10% on average.Finally,we integrate the proposed methods into a unified model and conducte experiments for a cross-sectional comparison with the current state-of-the-art speaker recognition models on the Vox Celeb-1 test set.In comparison with the model trained using only the Vox Celeb-1training set,EER and min DCF of our model achieve 2.53% and 0.284,respectively,which are the best results that can be achieved by training on the Vox Celeb-1 training set.In comparison with the model trained using only the Vox Celeb-2 training set,our proposed model achieves EER and min DCF of 1.43% and 0.17,respectively,which are comparable to the current state-of-the-art model trained using the Vox Celeb-2 training set in terms of EER,but achieves the best results in terms of min DCF.The number of parameters of our model are 1.9M,while the number of parameters in this comparison model are 13 M,which is 6.8 times more than the model in this paper.
Keywords/Search Tags:In the wild, Speaker recognition, Multi-stale aggregation, Assembling loss, Few-shot learning
PDF Full Text Request
Related items