Font Size: a A A

Research On Audio-Visual Event Localization Method And Speech Enhancement Method

Posted on:2023-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:C XueFull Text:PDF
GTID:2558307097994819Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The goal of the audio-visual event localization task is to locate the temporal boundaries where both audible and visible events occur and to identify the class of events that occurred.For this task,previous audio-visual event localization methods mainly focus on the temporal modeling of events with a simple fusion of audio and visual features.In natural scenes,a video records not only the events of interest but also ambient acoustic noise and visual background,resulting in redundant information in the raw audio and visual features.Thus,direct fusion of the two features often causes false localization of the events.To deal with this problem,this paper studies audio-visual event localization methods,aiming to achieve more accurate event localization by designing a new strategy for fusing audio-visual information.In addition,based on existing audio-visual event datasets,existing methods all take single-channel audio signals and visual signals as inputs.However,a time segment in a real scene may contain multiple audio and visual events.Directly using the original singlechannel audio signal as the input of the audio-visual event localization algorithm will bring great challenges to modeling the audio signal of the target event.Intuitively,an idea to deal with this problem is to first use the spatial information of the multi-channel audio signal to enhance the signal of the target event,and then execute the audio-visual event localization algorithm to achieve a more accurate event localization.Considering the complexity of the above proposal,this paper firstly studies multi-channel speech enhancement methods by taking speech events as the research object.This research can not only lay the foundation for the subsequent in-depth study of audio-visual event localization methods for multi-channel audio signals but also be used to improve speech quality in practical speech communication scenarios.For multi-channel speech enhancement tasks,the existing methods do not consider the phase spectrum of the reconstructed target speech or cannot balance the performance and real-time performance of the method,which leads to the fact that the existing methods cannot enhance the target speech well in practical scenarios.To deal with the above problems,this paper proposes a real-time multi-channel speech enhancement method based on a complex neural network masking to better enhance the target speech signal.In summary,the research contents and innovations of this paper are as follows:(1)For the audio-visual event localization task,this paper proposes a co-attention model by exploiting the spatial and semantic correlations between audio and visual features,which helps guide the extraction of discriminative audio-visual features for better audio-visual event localization.Specifically,the proposed co-attention model consists of a co-spatial attention module and a co-semantic attention module to model spatial and semantic correlations,respectively.The proposed co-attention model can be applied to a variety of event localization tasks,such as cross-modality localization and multimodal event localization.Experiments on a publicly available audio-visual event dataset show that the proposed method achieves the best performance compared to existing methods by learning both spatial and semantic co-attention.(2)For the multi-channel speech enhancement task,this paper proposes a complex mask estimation network combined with a complex attention model to reconstruct the target speech.The designed complex attention model is used to capture the multi-channel signals in the feature encoding stage.and is used to focus on the time-frequency bins that need to be reconstructed as target masks in the complex-valued mask estimation stage.During the testing phase,the paper also proposes to use differential beamforming techniques to further suppress noise.Experiments on public datasets show that the proposed method outperforms the baseline methods.
Keywords/Search Tags:Audio-Visual Learning, Event Localization, Attention Model, Speech Enhancement, Deep Learning
PDF Full Text Request
Related items