Font Size: a A A

Research On On/Off-Screen Speech Separation Algorithm Based On Multimodal Fusion

Posted on:2022-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2518306311491524Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Speech is one of the most important ways for humans to transmit information,but in reality,speech signals are often disturbed by other voices or environmental noise.Therefore,speech separation is one of the most important research directions in the field of signal processing.Speech separation comes from the "cocktail party problem",and it is often used as the pre-processing operation of other speech applications.It is important to automatic speech recognition,hearing aid development,and human-computer interaction.Traditional speech separation technology focuses on the processing of single-mode speech signals.However,with the development of multimedia applications and information technology,speech signal and speaker's video signal are often processed at the same time in recent years.Because the video signal is not affected by the acoustic environment,and the visual information of the speaker,such as facial expression and lip movement,has a strong relationship with speech signal,it has become a new research trend to use multimodal fusion for speech separation.In the scene of simultaneous interpretation and reporter interview,there is usually only one speaker in the video signal,while the audio signal is a mixture of two or more speakers' speech signals.The speech signal related to the speaker in the video is called the on-screen speech,and the speech signal unrelated to the speaker in the video is called the off-screen speech.In view of the above views,based on a large number of papers reading and research of existing schemes,this thesis proposes two sets of on/off-screen speech separation algorithms based on multimodal fusion.The main contents of this thesis are as follows:(1)This thesis proposes an on/off-screen speech separation algorithm based on audio-video fusion and residual completion.The algorithm uses an on-screen U-Net to process the mixed speech spectrogram,and the end-to-end generation predicts the on-screen speech spectrogram.At the same time,an audio-video feature fusion network based on CNN is used for audio-video feature extraction and fusion,and the fused audio-video fusion features are combined with on-screen U-Net,and audio-video fusion information is used to assist spectrogram prediction.In order to improve the effect of off-screen speech separation,the algorithm also innovatively introduces a residual U-Net,which completes the off-screen speech spectrogram by generating residual spectrogram,and eliminates the residual generated by the superposition and disturbance of on/off-screen speech.The algorithm has been tested on the VoxCeleb2 dataset,and it has been proved to be advanced and reliable.(2)This thesis also proposes an audio-video fusion on/off-screen speech separation algorithm that combines motion information.The algorithm uses spectrogram prediction U-Net to process the spectrogram of the mixed speech and predict the spectrogram of the on/off-screen speech respectively.And the image-optical flow fusion network combined with CNN and BLSTM is used for feature extraction and fusion of the lip image and motion information,and the lip image-optical flow fusion feature is input into the prediction U-Net to provide the spectrogram prediction information assistance.This algorithm introduces BLSTM on the basis of a CNN-based multimodal feature fusion network and extracts the time-series features in the image and motion information.And the algorithm does not use complete video information,but uses the gray image and optical flow information of the lip area,making full use of the close correlation between the video signal and speech signal of the lip area,and ensuring key information while eliminating redundant information.The algorithm was tested on the VoxCeleb2 dataset to verify its advancement in the task of on/off-screen speech separation and its robustness under different conditions.And the influence of the structure of each part of the algorithm on the overall result was also researched.
Keywords/Search Tags:Speech separation, Multimodal fusion, Deep learning, Convolutional neural network, U-Net
PDF Full Text Request
Related items