Study On Generation Of Spatial Audio Using Audio-Visual Cues

Posted on:2022-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:Z L Lv

Full Text:PDF

GTID:2518306575467894

Subject:Information and Communication Engineering

Abstract/Summary:

Sound reproduction technologies are developed to reproduce more accurate nature of sound,and one of the most important natures of sound is spatiality.Advanced spatial audio methods such as Ambisonics and binaural audio are used to reproduce sound more immersively and naturely,however those approaches are far less popular than stereo due to their high professionality or cost.In addition,with the popularity of network,multimedia like video and audio has become a common way of communication,but the fact is that most of the audios in plane or panoramic videos are mono or stereo,leading to a mismatch of audio-visual experience in spatiality.As a possible solution,Deep Neural Network predicting spatial audio from video with mono audio is proposed in the thesis,which features in extracting directional sound sources by Temporal Convolutional Network,fusing audio-visual features,and generating particular spatial audio contents.In this work,fouces are on the two main scenarios including generating binaural from plane video and generating first order Ambisonics from panoramic video.In the work of generating binaural audio from plane video,the neural network takes the time-frequency spectrum of the right channel or the sum of binaural channls as input,predicts a complex ideal ratio mask and reconstructs the binaural audio with more immersiveness.In the neural network,a primary ambient extraction module based on temporal convolution layers is proposed to fouces on directional components of sound field,and encoding module using the fusing feature of audio-visual is used for binaural predicting.The experiments of above network show the effectiveness of generating binaural audios and audio-visual fusing.In the work of generating first order Ambisonics audio from panormicn video,a neural network predicting three complex ieal ratio masks from input mono to three channels of first order Ambisonics is proposed and implemented.The neural network takes advantage of primary ambient extraction module and audio-visual fusion and is proved that the overall experience of video with predicted channels in spatiality exceeds the input one.

Keywords/Search Tags:

spatial audio, audio-visual, primary ambient extraction

Related items

1	Research Of The Primary-Ambient Extraction For 3D Audio
2	Primary And Ambient Components Extraction For Audio Scene Reproduction
3	Research On Algorithm Of Audio-visual Event Recognition And Sound Source Localization Based On Audio-visual Fusion
4	The use of ambient audio to increase safety and immersion in location-based games
5	Study On The Development Of Chinaâ€™s Audio-visual New Media
6	The Research Of Evoked EEG Feature By Audio-visual Stimulus
7	Research On Feature Extraction And Fusion Of Audio Visual Information
8	Research On Semantic Analysis And Understanding Of Multimodal Video
9	Research On Intelligent Audio Detection And Enhancement Method In Strong Noise Background
10	Research On Active Speech Signal Detection Technology In Audio Monitoring System