Font Size: a A A

End To End Speaker Recognition Under Noisy Environment

Posted on:2021-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q L ZengFull Text:PDF
GTID:2518306128976699Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Speaker recognition is a technique to determine the identity of a speaker by analyzing and extracting the characteristics of one or more speech signals.It is also called voiceprint recognition.Speaker recognition technology is one of the identity verification technologies with broad application scenarios after fingerprint recognition,face recognition and iris recognition.With its unique applicability,convenience and accuracy,and the characteristics of no body contact,it has become an important research hotspot in the field of speech.In the mid-1990 s,especially after the Gaussian mixture model was applied to the field of speaker recognition,speaker recognition technology has continued to attract researchers' attention and has been greatly developed and improved.At present,the speaker recognition technology has a high level of recognition effect on pure speech.However,in practical production applications,there is still much room for improvement in terms of the robustness,transferability,and phrase recognition rate of speaker recognition systems.Many of the most common noises in real life scenes have become one of the important factors that affect the performance of speaker model recognition.Therefore,how to effectively improve the performance of speaker recognition system in real noise environment has become one of the most important research hotspots in the field of speech.The main contents of this dissertation are as follows:(1)Introduced basic knowledge of speaker recognition and main technical difficulties currently faced,analyzed the advantages and disadvantages of common algorithms of speaker recognition,selects the basic i-vector speaker recognition model and LSTM speaker recognition model to use part of the data in the aishell open source data for preliminary experimental comparison.(2)This dissertation mainly studies the calculation of speech signal-to-noise ratio and the detection of cut amplitude,and puts forward the methods of realizing the calculation of speech signal-to-noise ratio and the detection of cut amplitude.Based on C + +,the calculation of batch speech signal-to-noise ratio is realized and based on python,the batch audio clipping detection tool is implemented respectively.In training,we can filter out the audio with clipping through speech clipping detection to improve the training data quality of the model;in practical application scenarios,we can improve the input audio quality by audio signal-to-noise ratio calculation to classify the audio quality and voice noise reduction to improve the performance of the speaker recognition system.(3)Compared with traditional speaker recognition technologies(such as GMM-UBM,JFA,i-vector,etc.),this dissertation focuses on the end-to-end speaker recognition method under the deep learning framework.CNN-LSTM and Res Net-LSTM fusion network models based on end-to-end are designed respectively,and comparative experiments are conducted through different signal-to-noise ratio data sets.The experimental results show that proposed model has better recognition performance than the basic CNN-LSTM speaker recognition model on two open speech data sets,and further prove that the use of deeper residual network instead of convolution network can better extract the features of speaker spectrum.By selecting Triplet Loss and GE2 E Loss instead of the softmax cross entropy loss function in the original network structure,the network structure is improved.The experiment shows that GE2 E Loss function can further improve the recognition performance of the current network model.
Keywords/Search Tags:speaker recognition, audio signal-to-noise ratio, CNN, LSTM
PDF Full Text Request
Related items