Font Size: a A A

Research And Implementation Of Multi-speaker Speech Separation Based On Deep Learning

Posted on:2022-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:T T LiFull Text:PDF
GTID:2518306740983309Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The speech separation problem refers to an agent or a robot can separate different categories of sounds from a signal containing multiple voices and noises in order to focus on a specific speech in a noisy environment.Although the research on speech separation has been going on for many years,it still can not achieve satisfactory results in noisy environment with multi-speakers.Many applications such as teleconference and video communication are facing the problem of noisy voice in the context of the normalization of the epidemic situation and speech separation has gradually attracted the attention of scholars.Therefore,this paper selects single channel multi-speaker speech as the research object.Through the research of existing speech separation algorithms,focusing on the natural consistency of audio-visual information,the task of speech separation is modeled from timefrequency domain and the neural network model is improved.In order to improve the operation speed and separation effect of speech separation model,the method of processing speech from time-frequency domain is replaced by time domain,and the time domain speech separation network model is proposed.The main contents of this paper are as follows:(a)This paper presents a Chinese universal audio-visual speech dataset named CUAVS which consists of audio-visual modalities and some of the audio is noisy.In addition,the speech content is Mandarin.In order to make the dataset,an automatic video filtering system is designed.The system accepts original noisy and irregular video and outputs audio-visual fully aligned voices and pictures.(b)This paper proposes a multi-modal speech separation network called ResUNet-P,which aims to improve the performance of speech separation in noisy environment and ensure the robustness of the model in the absence of vision.ResUNet-P is composed of three modules:feature extraction,fusion and upsampling.In the feature extraction module,residual connection is added to learn features of different dimensions.In the fusion module,Pearson correlation coefficient is used to fully fuse audio-visual features.In addition,ResUNet-P has strong adaptability to speech separation in the absence of vision.The speech quality evaluation metrics show that ResUNet-P is 33.3%better than pure speech separation and 4.2%better than U-Net.(c)The time-frequency transformation of speech is a time-consuming and performance loss processing method.This paper presents a time domain speech separation network based on dual channel recurrent neural network(DPRNN)to improve the speed and performance of speech separation.One dimensional convolution instead of time-frequency transform is used to encode audio,and visual features are embedded in DPRNN.Finally,Pearson correlation coefficient is used to fuse audio-visual features.In addition,a multi speaker speech separation solution in the scene of unknown number of speakers is designed,which uses the average energy of the separation channel to match the model of the corresponding number of speakers.In the scene of multi-speaker speech separation,the performance of time domain speech separation network is significantly better than that of ResUNet-P.for example,the separation effect of mixed two speakers is improved by 37.7%.Finally,the ResUNet-P and the improved DPRNN are tested on CUAVS,and compared with the classical audio-visual dataset GRID and TCD-TIMIT.The results show that the time domain speech separation model is nearly 40%higher than the time-frequency domain speech separation model,and the time domain speech separation model is more suitable for multispeaker speech separation.
Keywords/Search Tags:speech separation, multi-modal fusion, spectrogram, dataset, neural network
PDF Full Text Request
Related items