Font Size: a A A

Research On SincNet And Siamese LSTM Based Method For Speaker Verification

Posted on:2021-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:Yihenew Alemu HaileFull Text:PDF
GTID:2428330611999373Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speaker verification refers to verifying speakers from their voices.Current methods based on deep neural networks(DNN)can achieve ideal performance,but there are still some problems.Specifically,the current method is not interpretable enough in the front-end feature extraction.Meanwhile,the temporal information is not fully considered when extracting back-end embeddings.In addition,the problem of vanishing gradient is also need to be considered.In this paper,we try to explore the effective methods to solve the above problems,and the main research contents and contributions are summarized as follows:(1)We propose a SincNet and long short-term memory(LSTM)based framework to extract embeddings with more interpretable and temporal information.Specifically,the front-end SincNet introduces the sinc function to obtain the filter response characteristics.And the back-end LSTM with softmax loss is employed to learn the vocal track sound production to identify the speaker identity.Meanwhile,the vanishing gradient problem can be solved by the LSTM to maintain the residual error in backpropagation learning.In addition,the proposed framework is an end-to-end system which can directly match the raw waveforms to embeddings.The experimental results show that the proposed framework can achieve better performance than the baseline methods.(2)We also propose an improved framework based on the SincNet and Siamese LSTM architecture.In this framework,to avoid feature confusion between the same and different speakers,we use the Siamese network to contain two identical sub networks having the same configuration with the shared parameters.Based on the contrastive loss for the Siamese network,the pairs of utterances from the same speaker are mapping to be closer,and the pairs from different speakers are mapping more distantly from each other.Meanwhile,the contrastive loss can take the output of the network for a pair of utterance and calculates its distance of same speaker and contrasts that with the distance to different speaker.Experimental results show that the improved framework can obtain better performance than the first proposed framework,as well as other baseline methods.
Keywords/Search Tags:speaker verification, DNN, SincNet, LSTM, Siamese network
PDF Full Text Request
Related items