Font Size: a A A

Triplet Loss And Manifold Dimensionality Reduction Based Method For Text-independent Speaker Recognition

Posted on:2020-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:C M LiuFull Text:PDF
GTID:2428330590473215Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,more and more attention has been drawn to protecting the personal information privacy,and the bioinformatics based authentication technology is becoming more and more popular.Speaker recognition,as a core authentication technology,is widely used in judicial,access control,smartphone wake-up and other fields.Compared with fingerprint recognition,face recognition and iris recognition,speaker recognition has lower requirements for input equipment,and can be implemented with embedded microphone.As a consequence,it could be deployed in real life scenarios easily.At the same time,speaker recognition technology also plays an important role in the field of national security prevention and control,such as sus pect recognition based on telephone voice.Identification Vector(I-VECTOR)based speaker recognition technology is one of the mainstream research methods in the field of speaker recognition.However,the training steps of its model are complex,and different objective functions are used to optimize the solution in each stage,so that the errors generated in each stage cannot be corrected in the next stage.At the mean while,the supervector dimension obtained by the I-VECTOR method is higher,which brings a higher amount of computation.Recently,the end-to-end neural network method based on Triplet Loss uses a unique objective function to model the speaker,and avoids the problem of independent optimization in each stage of IVECTOR,and obtains a lower dimension of supervector,which can significantly reduce the load of computation.Besides,the idea of Triplet Loss function coincides with the speaker recognition target,that is,to shorten the distance of same classes and extend the distance of different classes.In the field of end-toend speaker recognition,Generalized End-to-End Loss(GE2E)for Textdependent speaker verification tasks was proposed by Google.This method aims at optimizing intra-class distance,but its training efficiency is much low er than that of Triplet Loss.To this end,referring to the GE2 E idea,this paper studies the improved end-to-end speaker recognition method based on intra-class distance constraints.In order to reduce the divergence of speakers,the manifold learning bas ed method,t-distributed Stochastic Neighbor Embedding(t-SNE)algorithm is used to compensate the channel distortion of the speaker embedding obtained by the neural network.Considering that some frames in real speeches are relatively pure and more helpful for speaker recognition,this paper adopts Attention Weighted Pooling method to effectively enhance the model robustness to noise.The proposed method of speaker recognition based on Triplet Loss end-toend feature embedding and t-SNE channel compensation has significantly improved the recognition performance on VoxCeleb1 dataset compared with the baseline systems.
Keywords/Search Tags:speaker recognition, triplet loss, t-distributed stochastic neighbor embedding, attention weighted pooling
PDF Full Text Request
Related items