Font Size: a A A

Research On Deep Learning Based Speaker Recognition Algorithm

Posted on:2021-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:T Y BianFull Text:PDF
GTID:2428330623484144Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the application and popularization of various intelligent terminal devices,biometric recog-nition technology has gradually played an increasingly important role in various authentication sce-narios due to its convenience.Speaker recognition is a kind of biometric recognition.It is based on human voice signals for recognition,and is widely used in various scenarios such as criminal investigation,financial risk control,and human-computer interaction of voice terminal devices.According to the application scenario,it can be divided into two tasks: speaker verification and speaker identification.Depending on whether the content of the speech is restricted,speaker recog-nition is divided into two categories: text-dependent and text-independent.This paper focuses on more challenging text-independent speaker recognition,and tests the speaker verification task and speaker identification task respectively.This paper proposes an end-to-end speaker recognition system implementation paradigm,in-cluding neural network models based on attention mechanism and model training methods based on metric learning.The neural network models proposed in this paper combines the residual convolu-tional neural network and the attention mechanism.It not only applies the attention mechanism to high-level feature extraction,but also proposes a time-domain pooling method based on the atten-tion mechanism to learn the ability that characteristics of speech segments are weighted adaptively.Based on triplet loss,this paper proposes a novel online hard sample mining method to unify the constraints of samples pairs of the same speaker.Based on this,a stable training scheme is proposed for the hard training of triplet loss.Trained with the Voxceleb1 data set,the proposed scheme achieves an equal error rate of 5.3%on speaker verification,surpassing the most popular i-vector model and x-vector model.In addition,this scheme is an end-to-end implementation that does not require redundant backends as scoring models,but both i-vector and x-vector models rely on separately trained PLDA models for scoring.In the case of training based on the Voxceleb2 data set,this scheme reduces the equal error rate on the Voxceleb1 test set to 4.05 %,which is better than Res Net-34 and Res Net-50 training with contrast loss.Besides,the complexity of the network model in this paper is much lower than that of Res Net-34.For general multi-classification tasks,this paper proposes a training paradigm combining met-ric learning loss function and softmax cross entropy,that is,the bottleneck features of the network are trained using the CRL loss function explained in this paper,and then the final fully-connected classifier layer is trained by softmax cross entropy.These two steps can be performed simultane-ously by cutting off the gradient propagation between the bottleneck feature and the classification layer.On the Voxceleb1 dataset,this method further improves the Top-1 accuracy by 3.6 %.
Keywords/Search Tags:Speaker Recognition, Deep Learning, Attention Mechanism, Triplet Loss
PDF Full Text Request
Related items