Font Size: a A A

Research On Multi-modal Speech Separation Based On Audio-visual Combination

Posted on:2022-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:C H LiFull Text:PDF
GTID:2518306788995359Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In a typical video call scenario,the frontal face of the target speaker will appear in the video and his speech will appear in the audio.But there will also be other interfering speaker speech and non-speech background noise.In this case,it is often difficult to hear the target speaker's speech clearly,which makes the listener feel uncomfortable,resulting in a bad listening experience.To improve the user's listening experience,we need to train a machine that can selectively isolate the target speaker's speech in a specific scene.In real life,the human ear has functions such as selective listening and perception of sound direction and distance,so that humans can selectively listen to the speech of the speaker they want to hear according to their own needs.Therefore,if we can make the machine have the above-mentioned capabilities of the human ear,the quality of the separated speech can be greatly improved.However,after decades of research,the proposed model is still unable to achieve a satisfactory separation effect.To overcome this problem still requires continuous efforts and innovation.According to the current research status of speech separation,this thesis analyzes the shortcomings of the current model,and proposes some new ideas.Combining the time-domain speech separation method with the audio-visual framework can achieve performance beyond a single model.Additionally,the facial features of the target speaker are used as anchors to identify and separate the target speech,and we utilize facial deformation to enlarge the mouth region as the visual feature input to this model.Experimental results show that our proposed model can effectively accomplish the task of speech separation.To sum up,our main contributions are as follows:1.The experimental results of comparing the proposed model with the Google audio-visual model and the time-domain model implemented on the AVSpeech dataset demonstrate that the time-domain audio-visual combined speech separation model is more effective than a single model.2.Comparing the results of the raw face input and the distorted face input through experiments,it is found that the proposed face distortion operation can make the final result improve the SI-SNR score by 2.3d B compared with the raw image input.3.We propose a new speech separation model.The model is divided into two branches.The time domain model is used in the processing of speech,and the dilated convolution is applied in the processing of images.Finally,audio-visual fusion is performed and the DPRNN module is used as a separation module for speech separation.Using this model can effectively to achieve the task of speech separation.
Keywords/Search Tags:speech separation, audio visual, speaker extract, face distortion, multi-modal
PDF Full Text Request
Related items