Research On On/Off-Screen Speech Separation Algorithm Based On Multimodal Fusion

Posted on:2022-09-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yang

Full Text:PDF

GTID:2518306311491524

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

Speech is one of the most important ways for humans to transmit information,but in reality,speech signals are often disturbed by other voices or environmental noise.Therefore,speech separation is one of the most important research directions in the field of signal processing.Speech separation comes from the "cocktail party problem",and it is often used as the pre-processing operation of other speech applications.It is important to automatic speech recognition,hearing aid development,and human-computer interaction.Traditional speech separation technology focuses on the processing of single-mode speech signals.However,with the development of multimedia applications and information technology,speech signal and speaker’s video signal are often processed at the same time in recent years.Because the video signal is not affected by the acoustic environment,and the visual information of the speaker,such as facial expression and lip movement,has a strong relationship with speech signal,it has become a new research trend to use multimodal fusion for speech separation.In the scene of simultaneous interpretation and reporter interview,there is usually only one speaker in the video signal,while the audio signal is a mixture of two or more speakers’ speech signals.The speech signal related to the speaker in the video is called the on-screen speech,and the speech signal unrelated to the speaker in the video is called the off-screen speech.In view of the above views,based on a large number of papers reading and research of existing schemes,this thesis proposes two sets of on/off-screen speech separation algorithms based on multimodal fusion.The main contents of this thesis are as follows:(1)This thesis proposes an on/off-screen speech separation algorithm based on audio-video fusion and residual completion.The algorithm uses an on-screen U-Net to process the mixed speech spectrogram,and the end-to-end generation predicts the on-screen speech spectrogram.At the same time,an audio-video feature fusion network based on CNN is used for audio-video feature extraction and fusion,and the fused audio-video fusion features are combined with on-screen U-Net,and audio-video fusion information is used to assist spectrogram prediction.In order to improve the effect of off-screen speech separation,the algorithm also innovatively introduces a residual U-Net,which completes the off-screen speech spectrogram by generating residual spectrogram,and eliminates the residual generated by the superposition and disturbance of on/off-screen speech.The algorithm has been tested on the VoxCeleb2 dataset,and it has been proved to be advanced and reliable.(2)This thesis also proposes an audio-video fusion on/off-screen speech separation algorithm that combines motion information.The algorithm uses spectrogram prediction U-Net to process the spectrogram of the mixed speech and predict the spectrogram of the on/off-screen speech respectively.And the image-optical flow fusion network combined with CNN and BLSTM is used for feature extraction and fusion of the lip image and motion information,and the lip image-optical flow fusion feature is input into the prediction U-Net to provide the spectrogram prediction information assistance.This algorithm introduces BLSTM on the basis of a CNN-based multimodal feature fusion network and extracts the time-series features in the image and motion information.And the algorithm does not use complete video information,but uses the gray image and optical flow information of the lip area,making full use of the close correlation between the video signal and speech signal of the lip area,and ensuring key information while eliminating redundant information.The algorithm was tested on the VoxCeleb2 dataset to verify its advancement in the task of on/off-screen speech separation and its robustness under different conditions.And the influence of the structure of each part of the algorithm on the overall result was also researched.

Keywords/Search Tags:

Speech separation, Multimodal fusion, Deep learning, Convolutional neural network, U-Net

PDF Full Text Request

Related items

1	Research On Monaural Speech Separation Technology Based On Deep Learning Joint Optimization And Feature Fusion
2	Audio-visual Multimodal Fusion Speech Separation Based On DCNN-BiLSTM And Improved U-Net Network
3	Dual Convolutional Neural Network Based Single Channel Blind Speech Separation Technology Research
4	Research On Supervised Speech Separation Based On Deep Learning
5	Binaural Speech Separation Research Based On Deep Learning
6	Research On Speech Separation Technology Based On Deep Learning
7	Research On Speech Separation And Recognition Based On Deep Learning
8	Speech Separation Based On Deep Learning
9	Research And Implementation Of Multi-speaker Speech Separation Based On Deep Learning
10	Speech Separation Based On Microphone Array And Deep Learning