Research On Audio Visual Fusion Speech Separation Method For Multi-person Dialogue Robot

Posted on:2022-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:D B Liu

Full Text:PDF

GTID:2518306491492144

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

In the multi-person and robot speech interaction,the speech of multiple target speakers is separated from the mixture,which is the premise of speech recognition and dialogue.Speech separation,also known as the "cocktail party" problem,is an effort to separate the speech of each speaker from a mixed signal from multiple sources.While human listeners can easily extract the target speaker's speech,the problem has remained a huge challenge for machines over the past few decades.Especially for single-channel speech separation,because the number of microphones is less than the number of speakers and the speech signals of multiple speakers overlap each other,the traditional speech separation method requires strong conditional assumptions and manual design rules,which makes the improvement of the separation performance encounter a bottleneck.With the proposed speech separation method based on data-driven deep clustering and permutation invariant training,the speech separation process is transformed into a deep learning problem,which solves the problem of label permutation ambiguity and greatly improves the speech separation effect.The audio-only modal method fails to make full use of the speaker's image information,and it is difficult to effectively solve the problem of multiperson speech separation stability even if microphone array is used.Speech separation method based on audio-visual multimodal fusion,using the speaker's lip motion information to assist the separation process can ensure the separation performance,but the existing time-frequency domain audio-visual fusion speech separation methods have the problems of large model size and high computational complexity of speech flow.To solve these problems,this paper proposes a single channel speech separation method based on time-domain audio-visual fusion.The auditory and visual modes of the speech separation system are unified modeled by time-domain codec structure,which improves the performance of speech separation.The research contents of this paper mainly include:1)Speech separation method based on time domain convolution is studied.In the time domain audio only speech separation method,neural network is used to extract the speech signal feature sequence,and the single mode separation model is modeled to separate the mixed speech.Finally,the structural parameters of the time-domain model are optimized and verified by comparative experiments.2)Study the multimodal speech separation method of time-domain audio-visual fusion.This paper studies how to extract a better visual feature sequence of target speakers and use the recurrent neural network to integrate the visual and audio feature sequences of speakers.At the same time,a better separation network is studied to perform separation and improve separation accuracy.Finally,the effectiveness of feature extraction and separation model is verified by comparative experiments under public data sets.3)Design and implementation of speech separation system.In this paper,the software system algorithm framework including data preprocessing,speech separation and other modules is designed,and a set of speech separation software system is implemented,which can collect images and speech signals,and the speech separation algorithm proposed in this paper is used for speech separation.This paper studies a single channel speech separation method based on time-domain audiovisual fusion,and conducts comparative experimental analysis on audio-visual data sets of audio-only and audio-visual separation models respectively.,the experimental results show that the presented time domain audio-visual integration separation method to improve the performance of speech separation,to promote dialogue robot application in many dialogue scenes has important practical significance.

Keywords/Search Tags:

Multi-person dialogue system, Time domain speech separation, Neural network, Audio visual fusion, Feature extraction

PDF Full Text Request

Related items

1	Audio-visual Multimodal Fusion Speech Separation Based On DCNN-BiLSTM And Improved U-Net Network
2	Research On Feature Extraction And Fusion Of Audio Visual Information
3	Research On Audio-visual Speech Separation
4	Research On Monaural Speech Separation Technology Based On Deep Learning Joint Optimization And Feature Fusion
5	Time Domain Speech Separation Algorithm Based On Deep Neural Network
6	Research On Multi-modal Speech Separation Based On Audio-visual Combination
7	The Research Of Time-Domain Speech Separation With Blind Source Based On Convolutional Neural Network
8	Research On Multi-person Speech Recognition Based On Deep Learning
9	Research On Noise Treatment Of Speech Recognition With Lip-movement Information
10	Research And Application Of Speech Separation Algorithm Based On Deep Neural Network