Font Size: a A A

Speaker Recognition Based On Multi-information Fusion

Posted on:2019-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:X FangFull Text:PDF
GTID:2428330542992462Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Speaker recognition,also named voiceprint recognition,is a biometric identification technology to identify speaker automatically according to the speaker's voice.In essence,it is a process of classification based on the speaker's feature.Therefore,this paper aims at extracting a more comprehensive feature that characterizing the speaker's information to further improving the performance of the speaker recognition system.The following are the main contents of this paper:1.Three traditional speaker recognition systems are built.According to the difference of training models,it can be divided into as the following: Speaker recognition based on TVM.That is,large-scale data was used to train universal background model(UBM),then calculating statistics of the subspace data and training TVM based on the frame posteriori probability.This system is named TVM-I-Vector.Deep neural network(DNN)based speaker recognition system.That is,using the DNN to compute the posterior of the frames with respect to each of the classes in the model by replacing the universal background model(UBM)in the TVM-I-Vector.This system is abbreviated as NN-I-Vector.Using the deep bottleneck feature(DBF)to replace such acoustic features as MFCC as the input of a speaker recognition system.This system is denoted as DBF-I-Vector.Because i-vectors are extracted without distinguishing speaker information from channel information of input utterances,LDA or PLDA is applied to reduce the influence of the channel on the recognition performance.2.Speaker recognition system based on feature fusion is constructed.The input feature of speaker recognition can be deep feature,such as DBF,and shallow feature,such as MFCC,PLP.The shallow feature is a low-level feature and is extracted from the short-time spectrum information,and is difficult to represent the high-level information of the input speech.deep features used in speaker recognition system take the phoneme discriminative information into consideration,but do not involve the intuitive physical layer acoustic features.According to the advantages anddisadvantages of the deep and shallow features,feature fusion is applied to achieve complementary advantages between the features and improve the performance of speaker recognition system.3.The speaker recognition system based on I-Vector model fusion is implemented.Different types of speaker recognition system,such as TVM-I-Vector,NN-I-Vector,has some differences in performance,but also has its own advantages.And the differences finally accumulated on the extracted feature vectors named i-vectors.Thus models fusion is proposed for speaker recognition system to explore the advantages of different speaker recognition systems and improve the performance of speaker recognition system.4.End-to-end speaker recognition system is built.About speaker recognition,end-to-end,its basic idea is to use the speaker embedding extracted from deep neural network replace the i-vector.To be specific,using the acoustic features as input feature,and then extracting fixed-length feature vectors named speaker embedding from out of the statistics pooling layer.Finally at the back end of the system,PLDA and cosine similarity are used to score between different i-vectors.This paper is the design and optimization of the end to end speaker recognition system under the guidance of this idea.This idea not only simplifies the training complexity of the system,but also adds discriminative information for speaker recognition system.
Keywords/Search Tags:speaker recognition, i-vector, deep neural network, model fusion, end-to-end
PDF Full Text Request
Related items