Font Size: a A A

Research On Speaker Representation Based On MG Training Criteria

Posted on:2022-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:J HanFull Text:PDF
GTID:2518306746451894Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Traditional speaker recognition systems mostly use Mel Frequency Cepstral Coefficient Feature(MFCC)and Gaussian Mixture Model(GMM)frameworks,and more speaker recognition frameworks have been increasing popular such as some technologies based on i-vector and deep neural network,etc.Although deep learning methods have shown good recognition performance in the field of speaker recognition,there is a potential problem based on the deep embedding of this method: the training objective of this model is purely discriminative,this means that the training objective is only to keep each speaker away from others,and does not take into account the distribution of these speaker vectors.This limitation will cause two serious problems:(1)For each speaker,the distribution of this specific class does not conform to a Gaussian distribution;(2)The distribution of vectors for different speakers are nonhomogeneous.However,non-Gaussianality and non-homogeneous will seriously affect the performance of back-end scoring models,especially for the most popular probabilistic linear discriminant analysis(PLDA)model,because PLDA calculates the likelihood ratio of two speeches based on the Gaussian assumption,the unconstrained distribution will cause the inaccurate PLDA scoring results.Therefore,this paper adopts a method based on Maximum Gaussianality(MG)training approch to extract highly regularized speaker embedding.The main work are as follows:Firstly,building a speaker recognition baseline system based on x-vector using the widely used Vox Celeb1 corpora.The complexity of the distributions of the speaker representation extracted by the current mainstream speaker recognition system in the high-dimensional space are verified,and it is confirmed that the vectors' distribution of the speaker representation extracted by the speaker recognition model in the highdimensional space has a characteristic of non-Gaussian and non-uniform by conducting a series of baseline system experiments.Secondly,using the Maximum Gaussianality(MG)training approach to regularize the speaker embedding in high-dimensional space.Vectors conforming to a highdimensional Gaussian satisfies the following two properties:(1)the length of most samples concentrates on a small annulus;(2)any two samples tend to orthogonal.Since the two properties are necessary conditions of a high-dimensional Gaussian,maximizing the length metric and the angle metric will directly optimize the Gaussianlity of the latent codes.Compared with the baseline system,the EER of the back-end scoring of the model is reduced by 1.2%?6%.The ability of Maximizing Gaussianality approach to normalize the high-dimensional speaker vectors' distribution delivered substantial performance gains and showed strong regularization power in speaker recognition baseline system.Finally,optimizing the PLDA scoring method and improve the scoring performance.When the number of classes in the training set is limited,the traditional PLDA scoring method is insufficient to estimate the inter-class variance for all speakers,which affects the performance of the back-end scoring system.Therefore,this paper proposes to modify the traditional PLDA scoring method by introducing an additional Inverse-Wishart prior,the equal error rate(EER)of backend scoring is reduced by0.2%?1%,and use the PLDA scoring method based on the maximum posterior probability estimation to improve the performance of the whole system back-end scoring.
Keywords/Search Tags:Speaker Recognition, Speaker Embedding, Gaussian Distribution, Normalization, Backend Scoring
PDF Full Text Request
Related items