Research On Multi-modal Speech Separation Based On Audio-visual Combination

Posted on:2022-12-24

Degree:Master

Type:Thesis

Country:China

Candidate:C H Li

Full Text:PDF

GTID:2518306788995359

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

In a typical video call scenario,the frontal face of the target speaker will appear in the video and his speech will appear in the audio.But there will also be other interfering speaker speech and non-speech background noise.In this case,it is often difficult to hear the target speaker's speech clearly,which makes the listener feel uncomfortable,resulting in a bad listening experience.To improve the user's listening experience,we need to train a machine that can selectively isolate the target speaker's speech in a specific scene.In real life,the human ear has functions such as selective listening and perception of sound direction and distance,so that humans can selectively listen to the speech of the speaker they want to hear according to their own needs.Therefore,if we can make the machine have the above-mentioned capabilities of the human ear,the quality of the separated speech can be greatly improved.However,after decades of research,the proposed model is still unable to achieve a satisfactory separation effect.To overcome this problem still requires continuous efforts and innovation.According to the current research status of speech separation,this thesis analyzes the shortcomings of the current model,and proposes some new ideas.Combining the time-domain speech separation method with the audio-visual framework can achieve performance beyond a single model.Additionally,the facial features of the target speaker are used as anchors to identify and separate the target speech,and we utilize facial deformation to enlarge the mouth region as the visual feature input to this model.Experimental results show that our proposed model can effectively accomplish the task of speech separation.To sum up,our main contributions are as follows:1.The experimental results of comparing the proposed model with the Google audio-visual model and the time-domain model implemented on the AVSpeech dataset demonstrate that the time-domain audio-visual combined speech separation model is more effective than a single model.2.Comparing the results of the raw face input and the distorted face input through experiments,it is found that the proposed face distortion operation can make the final result improve the SI-SNR score by 2.3d B compared with the raw image input.3.We propose a new speech separation model.The model is divided into two branches.The time domain model is used in the processing of speech,and the dilated convolution is applied in the processing of images.Finally,audio-visual fusion is performed and the DPRNN module is used as a separation module for speech separation.Using this model can effectively to achieve the task of speech separation.

Keywords/Search Tags:

speech separation, audio visual, speaker extract, face distortion, multi-modal

PDF Full Text Request

Related items

1	Research On Audio-visual Speech Separation
2	Research And Implementation Of Multi-speaker Speech Separation Based On Deep Learning
3	A Research On Generating Portrait From Speaker Voice Based On Deep Learning
4	Research On Audio Visual Fusion Speech Separation Method For Multi-person Dialogue Robot
5	Multi-speaker Speech Separation Based On Deep Learning
6	Multi-modal Speaker Authentication System Based On Speech And Lip Motion Correlation
7	Research And Implementation Of Multi-speaker Speech Separation Technology Based On Deep Learning
8	Research On Multi-modal Fusion Speaker Recognition Based On Audio-visual Data
9	Research On Audio-visual Cross-modal Sound Source Separation
10	Research On Monaural Speech Separation Of Specific Speaker Based On Deep Learning