Font Size: a A A

Audio-visual Multimodal Fusion Speech Separation Based On DCNN-BiLSTM And Improved U-Net Network

Posted on:2022-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:S B WangFull Text:PDF
GTID:2518306614459974Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Speech signal is one of the main ways of human civilization transmission.In real life,speech may be interfered by other human voices or noises.Therefore,in the field of speech signal processing,speech separation is dedicated to separate a single signal from multi-source signals.It is often used in the front-end processing of speech signals,and plays an important role in speech enhancement,speech recognition and human-computer interaction.Most of the traditional speech separation technologies are devoted to the single-mode signal processing of pure speech.With the development of multimedia technology,speech signals are often accompanied by video signals,so combining visual signals to assist speech separation has become a new research direction.In addition,the complex acoustic environment basically does not affect the acquisition of video signals,and there is a strong correlation between the speaker's face and lips and sound signals,so in recent years,there has been a trend of using multi-mode fusion method of visual and sound signals for sound separation.Based on the above methods,two speech separation methods based on multi-mode fusion are proposed on the basis of reading a large number of literatures and researches.The main research contents of this thesis are as follows:Firstly,in view of the problems of poor separation effect and low speech listening sensation caused by existing methods,this thesis proposes an audio-visual multi-modal speech separation technology based on DCNN-BiLSTM model.Dilated convolutions neural Networks(DCNN)are used to extract features from both visual and audio signals,and then feature fusion is performed to assist speech separation.The proposed method can obtain more signal features under low complexity,so as to separate speech better.Based on AVSpeech dataset,the ablation experiment of the proposed method was performed and compared with the AV model,and the speech separation performance of the proposed method was analyzed using evaluation indexes.The study shows that the proposed method has an average improvement of0.95 in objective Evaluation of Speech Quality(PESQ)compared with existing methods.In short-time Objective Intelligibility(STOI)there is an average increase of0.20,and in signal-to-distortion Ratio(SDR)there is an average increase of 3.73 dB.Secondly,compared with lip information,face information has a large amount of data,and lip information has a higher correlation with speech than face information,this thesis proposes a U-Net model that integrates lip information multiple times for speech separation.In this method,face is extracted by DLIB library,lip information is intercepted,lip information and audio information are fused for several times,and the fusion features are sent to U-Net network for spectral separation to improve the performance of speech separation.The proposed method is compared with other models and evaluated by PESQ,STOI and SDR.The results showed that compared with the DCNN-BiLSTM model,the PESQ,STOI and SDR of speech 1 separated by the model were improved by 0.05,0.03 and 0.11 dB,respectively,and the speech 2 separated by the model was improved by 0.03,0.01 and 0.10 dB,respectively.Finally,the speech separation model based on DCNN-BiLSTM can extract more features,which plays a positive role in improving the speech separation effect.In addition,the U-Net network that integrates lip information for multiple times uses lip and audio features to grasp details better.The input information is the gray image of lip.It can not only retain the key information but also reduce the training time of the system,which has stronger applicability.
Keywords/Search Tags:speech separation, multimodal fusion, dilated convolutions neural network, bi-directional short and long memory network, u-net
PDF Full Text Request
Related items