Font Size: a A A

Research On End-to-end Speaker Recognition Based On Raw Waveform

Posted on:2021-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:N X LiangFull Text:PDF
GTID:2428330611466497Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,society's demand for information security is increasing in the era of network information.How to identify an individual accurately and better protect personal information is a key problem to be solved in the era of intelligence.Biometrics,as a technology of personal identity authentication using human physiological and behavioral characteristics,has attracted more and more attention because of its convenience,security and efficiency.Among them,speaker recognition technology,also known as voiceprint recognition technology,is one of the popular research directions in the field of biometric recognition.With the advantages of voiceprint speech features,such as stability,uniqueness,easy to collect,etc.,it is widely used in human-computer interaction,identity recognition and other practical occasions.The introduction of deep learning has further promoted the development of speaker recognition technology.The end-to-end speaker recognition technology based on neural network has attracted many scientific researchers.However,the current end-to-end speaker recognition system generally adopts the process of "divide and conquer" : that is,first extract traditional speech features from the original speech signal,such as Mel-Frequency Cepstrum Coefficient,and then classify speakers based on the traditional speech features.This method usually relies on the artificial design of fixed and complex traditional acoustic features,and the extraction of traditional speech features and the training of speaker recognition model are often conducted separately,not from the overall point of view,and it is difficult to achieve the optimization of speech feature extraction and speaker recognition classification.Proper mechanism on cooperation of acoustic feature extractor and classifier is an essential and in the meantime challenging task.To investigate further,this paper promotes a new end-to-end speaker recognition framework,which,based on time convolution feature extractor on original audio signals,constructs and DNN speaker recognition model.The main work in this paper is:Firstly,this paper promotes a new end-to-end speaker recognition framework containing an acoustic feature extractor,DNN classifier,AM-Softmax and Triplet loss function mechanism.This recognition framework can jointly optimize the extraction of speech features and speaker classification,and achieve stable and accurate speaker recognition.Secondly,a new speech feature extraction method based on time-domain convolution is proposed,which can learn and extract an effective raw-front Feature from the original timedomain signal,and can be embedded in the speaker recognition system to replace the traditional fixed speech features,which improves the accuracy and robustness of extracting speech features directly from time-domain signals.Thirdly,to further explore the raw-front feature and its effectiveness,this paper builds several text-independent speaker recognition systems that use the raw-front feature as input and different deep neural networks as classification model.The application of speech feature extraction method based on time domain convolution in the end-to-end speaker recognition technology is discussed.The comparison and analysis of a large number of experimental results on the open source datasets CSTR VCTK Corpus and TIMIT show that the proposed end-to-end speaker recognition framework can achieve 1.93% and 9.61% equal error rate(EER)respectively,which achieves good speaker recognition accuracy performance.Compared with the traditional speech feature method,the raw-front feature proposed in this paper can obtain lower EER under the fixed classifier model experimental configuration.
Keywords/Search Tags:Speaker recognition, Acoustic feature, Deep Neural Network, Time convolution
PDF Full Text Request
Related items