Font Size: a A A

Research On Audio-visual Cross-modal Sound Source Separation

Posted on:2022-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:S MaFull Text:PDF
GTID:2518306764966879Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
The sound source separation task is one of the relatively old tasks in the audio field and was given the elegant name "cocktail party problem" when it was first proposed.This problem was originally designed to solve the problem of human voice separation in complex scenes.Later,with the increase of the complexity of the problem and the development and change of application scenarios,it was gradually divided into branch tasks such as speech separation,vocal and accompaniment separation,and music source separation.When solving this problem in the early days,people tried to consider the hardware,such as setting up a microphone array to increase the number of sound channels,and then using the related methods of signal processing to separate different human voices.Later,with the rapid development of neural networks and deep learning,people designed network frameworks to achieve sound source separation through deep learning.However,with the deepening of research,the realization of sound source separation by designing complex networks has gradually reached a performance bottleneck.This is because,with the increasing demand for separation performance,the interaction and overlap between different sound sources in the time-frequency domain gradually begin to be considered,and this overlap is particularly serious in the music source separation task.At the same time,in view of the successful application of deep learning in other fields,such as computer vision,researchers try to expand the data used from a single modality such as audio to multi-modal data.By establishing the correlation and correspondence between audio and visual cross-modal information,it can break through the performance bottleneck faced by only audio data in the past.There are two challenges in using cross-modal information for sound source separation: one is to achieve accurate correspondence of different modal information,and the other is to obtain high-quality sound source separation results.Therefore,this thesis addresses these challenges and makes the following work:1.In the correspondence of audio and video cross-modal information,by predetecting and pre-screening the sound source in the visual frames,the correspondence of irrelevant data can be excluded when the information of different modalities is corresponding so that the matching is more accurate.Therefore,based on the object detection method,Faster R-CNN,this thesis detects the hand position and orientation in the video frames and re-screens the object detection results to eliminate the negative impact of the wrong results when the object detection model detects data with poor generalization effect.In this thesis,the proposed screening scheme is experimentally verified,and the effectiveness of the algorithm is proved.2.For the separation result,we propose the concept of residual information in this thesis.By combining the residual information with visual information,the separation result is cyclically refined many times,so that the separation result can be supplemented when the information is insufficient,and it can be reduced when the information overflows.In this way,the final separation result is closer to the ideal real separation result.This thesis conducts a complete experimental verification of the residual information residual settings and the cyclic refinement scheme,which proves the effectiveness of the algorithm in this thesis.3.In this thesis,our algorithm is verified on several large-scale datasets,and compared with the latest methods in this field.By comparing the experimental results,our algorithm achieves the best results,which proves the effectiveness of the proposed algorithm and its excellent generalization performance.
Keywords/Search Tags:Sound Source Separation, Audio-visual Sound Source Separation, Crossmodal Learning, Self-supervised, Object Detection
PDF Full Text Request
Related items