Font Size: a A A

Speech Enhancement And Separation Based On Deep Neural Networks

Posted on:2021-11-27Degree:MasterType:Thesis
Country:ChinaCandidate:J WuFull Text:PDF
GTID:2568307184960389Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the application of the deep-learning based acoustic and language modeling techniques greatly improves the accuracy of the speech recognition.But in real environments,the neural conversation suffers from the challenging acoustic conditions,such as signal attenuation,noise,reverberation and speaker overlapping,which degrates the performance of the recognition engine seriously.Speech enhancement and separation are techniques proposed to solve the interference of the noise and speech overlapping and target at improving the quality of the speech and the interaction performance on the level of human auditory or recognition accuracy.In this paper,we focus on the enhancement and separation tasks and study the mask based adaptive beamforming technique,single channel time-domain audio-visual speech separation and the end-to-end multi-channel online speech separation system.1.Mask-based adaptive beamforming.Adaptive beamforming technique is a typical multi-channel speech enhancement algothrim and it could improve the quality of the speech via suppressing the energy on non-target directions in the far-field environments.Traditional adaptive beamforming,eg.,MVDR beamformer,depends on a source localization module to provide the speaker direction and can not get accurate estimatation of the covariance matrices,which brings limited improvements on both WER and SNR metrics.Mask-based beamforming is a geometry-independent algorithm,which estimates the steer vector and covariance matrices via predicting the Time-Frequency masks(TF-masks)of the noise and source speech.In this paper,we evaluate the effectiveness of the algorithm on CHi ME4 dataset and propose to use the ivector as the instructive features to estimate the specific speaker’s masks through the speaker-aware training techniques on CHi ME5 speech separation challenge.We get 10% absolute WER improvement on the development set on the official acoustic model.And with our own optimized acoustic model,we can further get 20% absolute WER improvement without system fusion,which is the result second only to i Fly Tek’s multiple model system.2.Single channel time-domain audio-visual speech separation.In speech enhancement/separation tasks,directly modeling speech on time-domain can achieve better performance than frequency-domain’s as it avoids the phase issues when reconstruct the signals.Compared to speech features,such as speaker embeddings and directional features,visual cues is stronger because it’s noise irrelevant and informative.Thus the audio-visual schemes that utilizes the visual cues as the priori is a common method for target speaker separation.Traditional audiovisual model uses the audio representation in frequency-domain and face or word level visual embeddings.Frequency-domain audio modeling methods need extra steps to solve the phase enhancement problem and the face or word level visual extractor produces rough features which has less relationship with the separation tasks.In this work,we propose to use phoneme level targets to train the visual feature extractor and time-domain audio representation method,which can better model the context of the target speaker and solve the phase enhancement problem,respectively.We simulate two-and three-speaker test set on LRS2 dataset and get 3~4 d B absolute improvements in Si-SNR compared with the state-of-the-art Conv-Tas Net and corresponding frequency-domain audio-visual method.3.End-to-end multi-channel online speech separation.The separation system designed for meeting transcription task needs to satisfy the requirements of the latency and recognition accuracy.Neural network based adaptive beamforming technique requires long speech context to estimate the signal covariance matrices,thus it can not bring stable improvements for speech recognition in low-latency scenario.Previously Microsoft proposed the Unmixing,Localization,Extraction(ULE)system,using fixed beamformer instead to avoid introducing additional system lantency.Considering that the fixed beamformer has limited power of the noise cancellation and depends on the source directions,it used the extraction network to filter the power of the interference speaker in the selected beam and the source localization model with the unmixing network to predict the speaker directions.Based on ULE system,we propose an end-to-end multi-channel network in this paper.It contains an attention based beam selection model and enables the joint optimization of the speech unmixing,source localization and extraction network during the training stage.The proposed model brings similar results with neural network based MVDR as well as ULE system in offline mode on two-speaker test set mixed using singlespeaker real recordings,but yield 12.47% and 22.40% relative improvement of WER over ULE in online mode.
Keywords/Search Tags:Speech Enhancement, Speech Separation, Neural Networks, Beamforming, Audiovisual, Multi-channel
PDF Full Text Request
Related items