Font Size: a A A

Research On Robust Speaker Features Based On Domain-adversarial Training And Attention

Posted on:2022-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:C Q LiFull Text:PDF
GTID:2518306572960049Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Human speech is particularly susceptible to external factors,such as environmental noise,differences in audio collection tools,differences in audio transmission channels,the type of language used by the speaker,the content and style of the speech and other factors will change the acoustic representation of the speech,thereby giving Speaker recognition performance has an impact.Therefore,it is particularly important to study robust speaker features in real scenarios.In addition,there may be a large gap between the distribution of the training corpus and the actual speech data.This problem can be summarized by domain mismatch.The key content of the robust feature research is mainly divided into two situations: one is the samedomain problem of the training test corpus that does not change much and is interfered by minor factors such as noise;the other is the obvious domain mismatch in the training test corpus The phenomenon of cross-domain issues.This paper studies the robust speaker characteristics on the same domain and cross domain issues.On the same domain problem,it is optimized from the perspective of the model,and features that contain more speaker information are extracted to improve the robustness of the features.For cross-domain problems,domain adversarial training is used to post-process the features to obtain domainindependent features that are robust to the domain.The main research contents include:(1)A speaker representation feature extraction method based on Long Short-Term Memory(LSTM)and attention mechanism is proposed.Use LSTM to model the time sequence information between frame-level features,optimize the feature conversion process of the representation feature extraction model at the frame-level and segmentlevel,and further extract the speaker component in the long-term information.In the process of fusion to obtain segment-level features,the attention method is used to optimize the frame-level feature weights,and frame-level features that are useful to the speaker are selected to suppress the influence of speaker-independent frame-level features to improve the robustness of features under the same domain data.Awesome.(2)Propose an attention calculation method based on adaptive query.Attention calculations are often accompanied by a global vector obtained by training.This global vector contains too much training data information,which will affect the extraction of speaker representation features in the test set.By using the adaptive query calculated based on the voice itself,instead of the global vector obtained by training,the conversion process from frame-level features to segment-level features is more focused on the need to extract speaker features and the speaker information contained in the voice itself,and further achieves the same domain Robust speaker feature extraction under the problem.The experimental results also show that this method also shows good results on cross-domain issues.(3)A method for extracting domain-independent speaker features using improved domain adversarial training is proposed.The optimization of the model cannot completely eliminate the interference of the domain information,so consider the method of feature post-processing to process the speaker features that have been extracted,and remove the domain information contained therein.Using the domainadversarial training method of the gradient reversal layer,the confrontation between the feature extractor and the domain classifier is realized.After training,a feature extractor is used to extract domain-independent speaker representation features to improve the robustness of speaker features to cross-domain problems.In the process of domain confrontation training,the descent direction of the gradient from the speaker classifier and the domain classifier to the feature extractor is not necessarily guaranteed to meet the optimization directions of the two classifiers at the same time after synthesis.Therefore,the gradient rotation method is used to make the synthesized gradient meet the two optimization tasks,and to ensure that the gradient descent direction during the training of the feature extractor is effective,so as to promote the training.
Keywords/Search Tags:Speaker recognition, Robustness, Domain-adversarial training, Attention
PDF Full Text Request
Related items