Font Size: a A A

Research On Speaker Speech Separation In The Scene Of Wearing A Mask

Posted on:2023-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:F M MengFull Text:PDF
GTID:2568307061454064Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At the beginning of 2020,as the Corona Virus Disease 2019(COVID-19)has swept the world,people often wear masks for daily communication.Wearing a mask not only reduces the sound intensity,but also blocks the visual information of the speaker’s lip shape change that is most relevant to the voice,which brings greater challenges to the single-channel speech separation task.In order to cope with the above challenges,this thesis chooses to conduct a single-channel multi-speaker speech separation research in the scene of wearing a mask.Specifically,the process of separating the audio source of each category from a mixed audio containing multiple speakers wearing masks at the same time in order to obtain a specific voice is realized.This thesis solves the problem of speech separation in the scenario of a speaker wearing a mask.The main work and contributions are as follows:(1)In terms of dataset construction,this thesis constructs a new speech separation dataset SSWM(Speech Separation while Wearing a Mask).This dataset is the first dataset that focuses on speech separation research in the scenario of a speaker wearing a mask.It includes two modalities,the visual modality of the face of the speaker wearing a mask and the speech modality of the speaker wearing a mask.It has 180 speakers and its cumulative time is up to 40 hours.(2)In terms of research on speech separation methods in the scenario of a speaker wearing a mask,this thesis proposes a new audio-visual fusion speech separation model SSWMNet(SSWM Network)to better adapt to the speech separation task in the scenario of a speaker wearing a mask.SSWMNet is an improvement on the classic UNet-based speech separation model: the attention mechanism is integrated into the feature extraction module of the U-Net speech separation network to fully extract shallow and deep features;the Dynamic Re LU activation function is used to replace the Re LU activation function.The experiment found that the audio-visual speech separation network incorporating the attention mechanism improved by 9.62%compared with the classic U-Net speech separation model in the objective evaluation index Normalized SDR(N-SDR)of speech separation.Compared with the U-Net speech separation model integrated with the attention mechanism,the speech separation model after using Dynamic Re LU has an improvement of 7.46% in the objective evaluation index N-SDR of speech separation.The experimental results prove the effectiveness of SSWMNet.In addition,this thesis also verifies the universality of SSWMNet.First of all,this thesis tested SSWMNet on the multimodal speech separation datasets GRID and TCD-TIMIT without masks,which can also obtain better speech separation effects.Therefore,SSWMNet is also applicable to speech separation tasks without masks;Secondly,this thesis also constructed audiovisual speech datasets of some speakers with masks and some speakers without masks.The experiments show that the speech separation performance of SSWMNet is also better than the U-Net based speech separation methods.(3)In dealing with the occlusion of lip visual information in the scene of a speaker wearing a mask,this thesis proposes a speech separation technology based on lip shape restoration network,and then uses the restored lip visual information to assist speech separation.Firstly,this thesis used three different lip shape restoration networks(Speech2Vid,Lip GAN and Wav2 lip network).All of these networks can complete the speaker’s restoration of lip visual information with the help of a single face picture without a mask and a speech of a speaker wearing a mask,and guarantee lip synchronization.Secondly,this thesis input the recovered lip visual information and the speech of the speaker wearing a mask into SSWMNet to complete the speech separation task.The experimental results show that by incorporating the lip visual information generated by the lip shape restoration network,the speech separation performance of SSWMNet is significantly improved compared to the network without this information,and Wav2 lip has the best effect among the three lip shape restoration networks.Finally,in order to verify the robustness of the network,this thesis tested four kinds of voice mixing methods: male-male voice mixing,female-female voice mixing,male-female voice mixing and random mixing.The experimental results show that the lip visual information generated by Wav2 lip has good robustness to the same or different gender mixed voices in the task of assisting the speech separation of speakers wearing masks.
Keywords/Search Tags:deep learning, speech separation while wearing a mask, multi-modal feature fusion, attention mechanism, spectrogram
PDF Full Text Request
Related items