Font Size: a A A

Research Of Speaker Recognition Technology Based On Kaldi

Posted on:2022-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:L P YueFull Text:PDF
GTID:2518306515466894Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Speaker Recognition,also known as voiceprint recognition,is a biological feature Recognition technology that identifies a Speaker based on speech information.There are not rich and colorful information in speech,both the common information reflecting the speech content and the different information representing the personality characteristics of the speaker.Different from the common information,speaker discrimination pays more attention to the different information,so as to identify the identity of the speaker.Compared with other recognition technologies such as facial recognition,speaker recognition has outstanding advantages.It is not only convenient in operation,low in cost,but also high in recognition accuracy.Now it has been vigorously promoted and actively applied in many industries such as finance and military,and its application prospect is very promising.In this paper,the characteristics,modeling methods and scoring strategies of various acoustic features are comprehensively considered.By using Kaldi speech recognition tools and related theories and technologies of signal processing,the advantages and disadvantages of different acoustic features and acoustic modes are combined to evaluate the measured speech data.Firstly,the feature extraction and feature processing technology are used to dynamically integrate different acoustic features,complement the differences of different acoustic features,reduce the influence of noise and other mixed information on the speaker features,and form new input features.Secondly,the relevant theories and technologies of deep learning and natural language processing are used to purify the features and reduce the influence of mixed information such as noise on the speaker's features again.Finally,using the grading strategy in the speaker recognition technology,will feature into I-vector,x-vector,such as speaker recognition model,combined with the cost function and activation function,voice and data to test evaluation,from the characteristics of the filter has nothing to do with the speaker,identification of characteristics related to the speaker,a phased implementation of speaker recognition information detection.The main research contents of this paper are as follows:(1)The speaker recognition algorithm based on multi-feature i-vector is used to realize the first stage of the speech to be measured.Firstly,Kaldi speech recognition tool is used to collect different acoustic features from TIMIT corpus and construct a high-dimensional feature vector.Then,principal component analysis is used to effectively eliminate the correlation of high-dimensional feature vectors and ensure the orthogonalization of various features.Finally,probabilistic linear discriminant analysis was used for modeling and scoring,and the spatial dimensions were reduced to a certain extent.Equal error ratio was used to evaluate the comprehensive performance of the speaker recognition system.(2)An X-Vector speaker recognition algorithm based on multi-feature and multi-task learning is used to realize the second stage of the speech to be tested.Firstly,Kaldi speech recognition tool is used to collect complementary acoustic features at different scales from Vox Celeb1 corpus and input them into the network at the same time.Then,the features of different complementary features are integrated inside the network,and the features flowing into the network are spliced left and right in the fully connected splicing layer.Finally,the attention mechanism was used to calculate the frame weight,and the rectified linear unit was used to reduce the gradient dispersion.EER and the detection cost function were used to evaluate the overall performance of the speaker recognition system.(3)The speaker recognition algorithm based on language spectrum and multi-head attention mechanism is used to realize the final stage detection of speaker recognition.First,Kaldi speech recognition tool was used to collect the spectrum and MFCC acoustic features from the Vox Celeb2 corpus,and the two acoustic features were successively sent to TDNN and CNN.Then,the spectrogram is processed by taking advantage of the image processing advantage of CNN.Finally,multi-head attention mechanism was used to weight the features after network processing,and EER and DCF were used to evaluate the comprehensive performance of speaker recognition system.This paper in the speaker recognition TIMIT evaluation set to verify the effectiveness of the first stage of the proposed algorithm,compared with the single feature I-vector model,investigate the best achieved 90.0% relatively lower(8.33% to 0.833%),in distinguishing between gender difference model,investigate the best of men and women were achieved85.6% of relatively low(11.67% to 1.38%)and 92.3% of the relatively lower(9.72% to0.69%).In speaker recognition Vox Celeb1 evaluation set in the second stage is verified the effectiveness of the proposed algorithm,compared with the X-ray baseline vector model,add the statistical layer model of attention mechanism in investigate the best achieved 24.4%relatively lower(2.01% to 1.52%),on the basis of introducing the multitasking learning and adopting splicing layer,on the investigate the best achieved 29.0% relatively lower(1.38% to0.98%).In speaker recognition Vox Celeb2 evaluation set is the last stage of the effectiveness of the proposed algorithm is verified and compared with X-ray baseline vector model,USES the combination of the spectra and CNN on investigate the best achieved 6.69% relatively lower(6.58% to 6.14%),on the basis of long attention mechanism are introduced to investigate on the best achieved 26.14% relatively lower(6.58% to 4.36%).
Keywords/Search Tags:Speaker recognition, Attention mechanism, Multi-tasking learning, Kaldi framework, Multi-feature fusion
PDF Full Text Request
Related items