Font Size: a A A

Research On Speech Separation Based On Visual Assistance

Posted on:2023-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:2568307043985619Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Speech separation aims to separate clean target speech signals from mixed audio with multiple noises.With the coordinated development of speech recognition and other technologies,speech separation has been used in the fields of film post-production,news communication,and intelligent translation widely.The research shows that the multi-modal information of the target speaker in the video,such as face information and angle information,can improve the performance of speech separation effectively.In the existing research,the multi-modal assisted speech separation technology pursues the diversity of auxiliary information excessively,while ignoring the deep correlation between a single auxiliary information and speech signal.The speech separation model using visual and other auxiliary information is too dependent on visual information,and its robustness is low in the face of the reality of insufficient visual information.At the same time,for the scene of complete loss of visual information,the target speech quality obtained by pure speech separation is lower than that obtained by multimodal separation significantly.Aiming at the above problems,this paper makes in-depth theoretical research and experimental proof on pure speech separation and multimodal speech separation.The main research contents and innovations of this paper are as follows:1)Aiming at the situation with visual information as auxiliary,a speech separation network based on deep audio-visual correlation is proposed.Before visual feature assisted speech separation,the target speech is fed into the visual feature extraction network,and the visual features are further extracted by using the deep correlation between audio and video,to realize the two-way enhancement of vision and speech,to better assist speech separation.The qualitative and quantitative experimental results on public data sets show that the network model can improve the quality of speech separation significantly,and its performance index is better than the existing methods.2)The self-made continuous data set SCD uses the target speaker’s speaking habits and other factors to strengthen feature extraction and fusion,excavate the deep correlation of audiovisual time deeply,and cooperate with the above networks to improve the separation performance.At the same time,in order to reduce the excessive dependence of the network model on visual information,the mean filter is used to blur the input image,to improve the robustness of the model randomly.The experimental results show that compared with the public data set,the audio-visual correlation speech separation network shows better performance on SCD data,and the model does have high robustness.3)Aiming at the complete loss of visual information,a speech separation network based on synthetic face assistance is proposed.The network uses speech to drive face synthesis,and equips the speech signal with corresponding visual features as auxiliary information artificially,to improve the speech quality of pure speech separation output.On this basis,make full use of the non-training speech segment of the target speech as the self-reference verification of identity information,optimize the network and improve the separation speed of the model effectively.The research shows that compared with pure speech separation,the synthetic face can improve the performance of speech separation significantly and effectively.At the same time,selfreference speech blessing improves the speed of model speech separation.
Keywords/Search Tags:Speech separation, Speech enhancement, Audio visual relevance, Face synthesis, Image feature extraction
PDF Full Text Request
Related items