Font Size: a A A

Research On Audio Visual Fusion Speech Separation Method For Multi-person Dialogue Robot

Posted on:2022-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:D B LiuFull Text:PDF
GTID:2518306491492144Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In the multi-person and robot speech interaction,the speech of multiple target speakers is separated from the mixture,which is the premise of speech recognition and dialogue.Speech separation,also known as the "cocktail party" problem,is an effort to separate the speech of each speaker from a mixed signal from multiple sources.While human listeners can easily extract the target speaker's speech,the problem has remained a huge challenge for machines over the past few decades.Especially for single-channel speech separation,because the number of microphones is less than the number of speakers and the speech signals of multiple speakers overlap each other,the traditional speech separation method requires strong conditional assumptions and manual design rules,which makes the improvement of the separation performance encounter a bottleneck.With the proposed speech separation method based on data-driven deep clustering and permutation invariant training,the speech separation process is transformed into a deep learning problem,which solves the problem of label permutation ambiguity and greatly improves the speech separation effect.The audio-only modal method fails to make full use of the speaker's image information,and it is difficult to effectively solve the problem of multiperson speech separation stability even if microphone array is used.Speech separation method based on audio-visual multimodal fusion,using the speaker's lip motion information to assist the separation process can ensure the separation performance,but the existing time-frequency domain audio-visual fusion speech separation methods have the problems of large model size and high computational complexity of speech flow.To solve these problems,this paper proposes a single channel speech separation method based on time-domain audio-visual fusion.The auditory and visual modes of the speech separation system are unified modeled by time-domain codec structure,which improves the performance of speech separation.The research contents of this paper mainly include:1)Speech separation method based on time domain convolution is studied.In the time domain audio only speech separation method,neural network is used to extract the speech signal feature sequence,and the single mode separation model is modeled to separate the mixed speech.Finally,the structural parameters of the time-domain model are optimized and verified by comparative experiments.2)Study the multimodal speech separation method of time-domain audio-visual fusion.This paper studies how to extract a better visual feature sequence of target speakers and use the recurrent neural network to integrate the visual and audio feature sequences of speakers.At the same time,a better separation network is studied to perform separation and improve separation accuracy.Finally,the effectiveness of feature extraction and separation model is verified by comparative experiments under public data sets.3)Design and implementation of speech separation system.In this paper,the software system algorithm framework including data preprocessing,speech separation and other modules is designed,and a set of speech separation software system is implemented,which can collect images and speech signals,and the speech separation algorithm proposed in this paper is used for speech separation.This paper studies a single channel speech separation method based on time-domain audiovisual fusion,and conducts comparative experimental analysis on audio-visual data sets of audio-only and audio-visual separation models respectively.,the experimental results show that the presented time domain audio-visual integration separation method to improve the performance of speech separation,to promote dialogue robot application in many dialogue scenes has important practical significance.
Keywords/Search Tags:Multi-person dialogue system, Time domain speech separation, Neural network, Audio visual fusion, Feature extraction
PDF Full Text Request
Related items