Font Size: a A A

Audio-visual Understanding With Self-supervised Learning

Posted on:2024-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:H S WangFull Text:PDF
GTID:2568307079459454Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of deep learning and artificial intelligence has greatly influenced human society and has been deeply involved in people’s daliy life.Although vision is favored by researchers in the fields of artificial intelligence and computer vision,audio is still a vital part of how we learn and understand the real world.Vision and audio are the most direct way for people to perceive the real world.Correspondingly,researchers have begun to rely on vision and audio to explore and improve the perception ability of machines.The comprehensive utilization of multimodal information can often provide more abundant and complementary information,thus breaking through the performance limitations of machines using single modal information and enhancing the ability to solve problems.However,visual features and audio features belong to two modalities.It is a key and difficult problem to realize cross-modal matching and alignment of audiovisual modality.In order to enhance the understanding of audio and video multimodal learning and explore the matching and alignment relationship between audiovisual modality,in this thesis,the task of audio-visual sound source separation and audio inpainting are taken as specific tasks.This thesis uses the deep learning network to solve the complex problems in audiovisual multimodal tasks in the way of self-supervised learning,so as to deepen the understanding of audiovisual multimodal learning.1)The objective of audiovisual sound source separation task is to separate the sound produced by a given visual object from the mixed audio.Due to the problem of crossmodal isomerism caused by the difference of modality distribution,and the problem that the separation accuracy is greatly affected by the object detection,this thesis designs a category-guided sound source separation modal,and makes comprehensive use of the three model information to achieve more accurate cross-mode matching and improve the effect of sound source separation.2)In order to solve the problem of insufficient separation ability of real-world mixture caused by the commonly used ”mixed-separation” training method,a partially supervised audiovisual sound source separation model is proposed in this thesis,which combines realworld and artificial mixtures into the training in a partially supervised way to improve the generalization ability of the model for real-world mixture separation.3)The goal of audio inpainting is to repair the missing parts of the corresponding audio under the guidance of visual information.This task has the challenge of cross-modal heterogeneity and long time audio inpainting difficulty.This thesis designs a long time audio inpainting network based on iterative feature reasoning,which improves the repair ability of the model for long time audio through iterative reasoning.The significance of this project is to better understand the relationship between audio and video modality,explore the usage of audio and video combination,and improve the perception of the machine to the real world.Specific to practical application scenarios,the work of this thesis can play a role in many practical needs,such as instrument ensemble teaching,speech recognition under multiple sources,enhancement and denoising in voice editing,human-computer interaction,sound source loacalization,audio inpainting and improvement,and comprehensive evaluation of unconstrained video.
Keywords/Search Tags:Audiovisual sound source separation, Audio inpainting, Cross modality Matching, Self-supervised learning
PDF Full Text Request
Related items