Font Size: a A A

Monaural Multi-speaker Speech Separation And Recognition

Posted on:2020-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:X K ChangFull Text:PDF
GTID:2428330620459985Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The cocktail party problem,i.e.,tracing and distinguishing the speech of a specific speaker when multiple speakers talk simultaneously,is one of the most critical problems in speech processing.Despite all the progress has been made in automatic speech recognition(ASR),significant performance degradation is still observed when recognizing multi-talker mixed speech.Because of the recent progress achieved by deep learning,researchers proposed many deep-learning based methods in the multi-speaker speech separation and recognition tasks.In this work,we exploited using permutation invariant training(PIT)in the monaural multi-speaker speech separation and recognition.We proposed three main innovative approaches.Firstly,we used the ASR criterion as our final goal.We proposed to train the monaural multi-speaker speech separation/recognition model by using speech feature separation and speech recognition as criterions.And we also applied joint learning to combine the speech feature separation and recognition.Furthermore,we introduced the gated convolutional network and attention mechanism in this task to improve the speech recognition performance.Secondly,to address the mismatch between the training and evaluation data,we proposed a speaker adaptive training technique using auxiliary features in monaural multi-speaker speech recognition task.We also did multi-task learning by using the auxiliary feature as a second task.Thirdly,we used the end-to-end models,which are popular in automatic speech recognition recently,in our multi-speaker task and presented a state-of-the-art monaural multi-speaker end-to-end automatic speech recognition model.All the methods proposed in this work were evaluated on two artificially synthesized corpus,i.e.AMI-mix and WSJ-mix.The results show that the PIT based monaural multi-speaker speech recognition model can achieve a significant reduction in terms of word error rate(WER),compared with normal speech recognition systems.
Keywords/Search Tags:Neural Network, Permutation Invariant Traning, Cocktail-Part Problem, Speaker Adaptive Traning, Speech Recognition
PDF Full Text Request
Related items