Font Size: a A A

Study On Generation Of Spatial Audio Using Audio-Visual Cues

Posted on:2022-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LvFull Text:PDF
GTID:2518306575467894Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Sound reproduction technologies are developed to reproduce more accurate nature of sound,and one of the most important natures of sound is spatiality.Advanced spatial audio methods such as Ambisonics and binaural audio are used to reproduce sound more immersively and naturely,however those approaches are far less popular than stereo due to their high professionality or cost.In addition,with the popularity of network,multimedia like video and audio has become a common way of communication,but the fact is that most of the audios in plane or panoramic videos are mono or stereo,leading to a mismatch of audio-visual experience in spatiality.As a possible solution,Deep Neural Network predicting spatial audio from video with mono audio is proposed in the thesis,which features in extracting directional sound sources by Temporal Convolutional Network,fusing audio-visual features,and generating particular spatial audio contents.In this work,fouces are on the two main scenarios including generating binaural from plane video and generating first order Ambisonics from panoramic video.In the work of generating binaural audio from plane video,the neural network takes the time-frequency spectrum of the right channel or the sum of binaural channls as input,predicts a complex ideal ratio mask and reconstructs the binaural audio with more immersiveness.In the neural network,a primary ambient extraction module based on temporal convolution layers is proposed to fouces on directional components of sound field,and encoding module using the fusing feature of audio-visual is used for binaural predicting.The experiments of above network show the effectiveness of generating binaural audios and audio-visual fusing.In the work of generating first order Ambisonics audio from panormicn video,a neural network predicting three complex ieal ratio masks from input mono to three channels of first order Ambisonics is proposed and implemented.The neural network takes advantage of primary ambient extraction module and audio-visual fusion and is proved that the overall experience of video with predicted channels in spatiality exceeds the input one.
Keywords/Search Tags:spatial audio, audio-visual, primary ambient extraction
PDF Full Text Request
Related items