Font Size: a A A

Research On Feature Learning In Speaker Recognition

Posted on:2019-06-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:L T LiFull Text:PDF
GTID:1368330590451478Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speaker recognition(SRE),an important biometric recognition technology,is the process of automatically identifying or verifying the identity of a person from his/her voice.After decades of research,SRE has gained great performance improvement,and has been deployed in a wide range of applications.However,the present SRE approaches are far from reliable,especially in unconstrained conditions that are full of unpredictable uncertainties,e.g.,free text,multiple channels,environmental noises,speaking styles.An intuitive idea to address these uncertainties is to discover features that are sensitive to speaker traits but robust against other uncertainties.Therefore,this dissertation focuses on deep feature learning in speaker recognition.The major contributions of this dissertation are as follows:1.A convolutional time-delay deep neural network for speaker feature learning.From the properties of speech signal,considering the representation of speaker traits and the trainability of model design,a convolutional time-delay deep neural network(CTDNN)which consists of a convolutional component and a time-delay component was built to learn deep speaker features.By means of qualitative and quantitative analysis,it demonstrated that the learned features are strong discriminative for speakers.2.Research on the generalizability of deep speaker features.The training objective of speaker feature learning is to discriminate among different speakers rather than directly for speaker recognition task.Therefore,several schemes were made from different perspectives to verify the effectiveness of deep speaker features and prove the generalizability of feature learning approach.3.Full-info training for speaker feature learning.Considering the training objective of speaker feature learning only focuses on maximizing the inter-speaker variation while neglecting the constraints of within-speaker variation,there exists within-speaker divergences in deep speaker features.Therefore,a full-info training approach based on centroid-converge criterion was proposed.On the premise of maximizing the inter-speaker variation,a within-speaker constrain was injected in the training process to improve the cohesiveness of deep speaker features.4.Phone-aware training for speaker feature learning.Considering the training process of speaker feature learning completely depends on the complex model structure and a large amount of training data,this ‘blind' data-driven learning is highly susceptible to other non-speaker factors,especially the phonetic content.Therefore,inspired by the success of conditional learning,a phone-aware training approach based on phoneticcompensation criterion was proposed.The phonetic information of each frame was informed in the training process.By this phonetic compensation,the within-speaker variation caused by phonetic content can be largely explained away,and the quality of the learned features was improved.
Keywords/Search Tags:speaker recognition, feature learning, deep learning
PDF Full Text Request
Related items