Font Size: a A A

Research On Speaker Recognition Based On Supervised I-vector Space Learning

Posted on:2021-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:1368330614450826Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speaker recognition is one of the most important technologies in identification researches.It has been widely used in various identity authentication systems and attracted much attention of enterprises and researchers due to its advantages,such as no password required,no individual contact and low cost.After decades of development,a great progress has been made in speaker recognition.Since feature extraction is an important process to extract unique speaker information from speech signals,it has been one of the most significant tasks in speaker recognition.In general,the frame-level features are employed to characterize speech signals due to its short-term stability in speech processing.However,there is more identity information hidden in the statistical characteristics of longer utterances.Therefore,it has been paid more attention to extracting utterancelevel feature from the sequence of frame-level features.Significantly,most utterance-level features rely on the learning of feature space.Among these feature space learning methods,the identity-vector(i-vector)space learning approach is one of the most widely used methods.However,the current i-vector approach is still lacking on effectively utilizing the category information that contains important prior knowledge.Based on the above analysis,we focus on the effective using of category information to conduct the supervised learning of the i-vector space.And our study can be divided into two aspects.On the one hand,the category information is employed directly.We try to find the common information between data and their labels,and then introduce this common information into the i-vector space.On the other hand,the category information is utilized indirectly.Since the back-end classifier can adopt the labels effectively,we try to jointly consider the training of the classifier and the learning of the i-vector space.Thus,the discriminative information obtained from the learning process of the classifier can be applied to supervise the learning of the i-vector space.The main research contents and contributions are summarized as follows:(1)In the aspect of using category information directly,we try to introduce the correspondence between data and their labels by constructing their common subspace.And the partial least squares method is utilized to choose a common subspace as the i-vector space,which has the effective and correlated information from data and their labels as much as possible.Then we propose an i-vector space learning method that learns the correlationin the common subspace between data and their labels.And it can be much easier to discriminate the i-vectors extracted from this subspace benefited by its supervised learning.Meanwhile,we also design a feature dimension selection method based on the predicted label correlation.The experimental results show that the proposed method can improve the performance of speaker recognition system effectively.(2)In the aspect of using category information directly,when there is no sufficient data of the development set,it is difficult to get enough information from the development data.To solve this problem,the Gaussian distribution is employed as the prior conditional distribution of data and their labels given the common content.Then the relationship between data and their labels can be established by utilizing the prior information.Meanwhile,the probability partial least squares method is used to learn a common latent variable by maximizing the joint density of data and their labels.This variable is regarded as the common content,and its related space is called the i-vector space.Then we propose an i-vector space learning method based on the common latent variable representation.Since more prior information is introduced,the i-vectors extracted from this space can be much easier to discriminate and less affected by the small data size.The experimental results show that the proposed method can obtain better performance than the other methods when the data is insufficient.(3)In the aspect of using category information indirectly,since the i-vector space for session compensation(we called it the Se Co feature space for short)and the backend classifier are usually separated in learning processes,the Se Co phase cannot utilize the discriminative information learned by the classifier according to the labels.To solve this problem,we try to jointly optimize the learning of the Se Co feature space and the classifier.Thus,the discriminative information can be fed back into the learning process of the Se Co feature space in a supervised way.Meanwhile,since the dictionary learning method with sparse constraint can provide more simple and linearly-separable features for the classifier to compensate for the effects of session variability effectively,we propose a session-independent i-vector space learning method based on the task-driven dictionary learning framework.The experimental results show that the proposed method can further improve the discrimination of the session-compensated i-vectors and achieve better performance than other session compensation methods.(4)In the aspect of using category information indirectly,since different phases in the i-vector approach are learning in a task-segmented way,each phase has independentoptimization objective.Thus,all phases except the classifier cannot adopt the discriminative information learned by the classifier.To solve this problem,we try to learn all the other phases by employing the supervision of the recognition task(classifier)and feed the discriminative information back into the learning processing of all these phases.Therefore,all these phases can be optimized towards an unified recognition task.Then we propose an i-vector space learning method based on the task-driven multilevel joint optimization framework and place each phase in a different layer of the multilevel structure.And we also provide a joint solution for this multilevel framework.The experimental results show that the proposed method can obtain better performance than the i-vector method by using the task-segmented strategy,as well as other supervised methods.
Keywords/Search Tags:Speaker recognition, i-vector space learning, supervised learning, joint optimization
PDF Full Text Request
Related items