Font Size: a A A

Research On Audio-visual Speech Separation

Posted on:2021-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:C D LiFull Text:PDF
GTID:2518306503991059Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Humans have the ability to trace and distinguish the speech of any target speaker in a complex environment where multiple speakers speak simultaneously.The problem of establishing an auditory model to make intelligent machines have similar capabilities known as cocktail party problem.Speech separation is one of the important technologies to solve the cocktail party problem.In recent years,with the development of deep learning technology,speech separation technology combined with deep learning has been developed and has made significant progress.However,most studies only use audio information in real scenes,and other modal information has not been effectively used.From the perspective of multi-modal fusion,the research in this paper explores the method of incorporating the visual information in the real scene into the speech separation system to improve the system performance.Firstly,we have designed an audio-visual speech separation system,which extracts the visual information of the target speakers and incorporates it into the speech separation tasks.Secondly,we have explored different ways of incorporating visual information,we have also developed an attention based mechanism for better utilizing the visual information.Furthermore,we have designed an approach to directly extract the speaker contextual information from the mixed audio and target speakers' visual information.By integrating the contextual information of target speaker into the speech separation system,further performance improvement has been achieved.In this paper,related experiments are carried out on the LRS2 and Vox Celeb2 audio-visual datasets,and the proposed methods are systematically verified.The experimental results show that,compared with the baseline system,the proposed methods have shown significant and consistent performance improvement.
Keywords/Search Tags:speech separation, audio-visual, multi-modal, cocktail party problem
PDF Full Text Request
Related items