Font Size: a A A

Research On Long Time Sequence Speech Enhancement Based On Multi-task Learning

Posted on:2023-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:J G RenFull Text:PDF
GTID:2568306776475714Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of Internet and computer technology,people have higher and higher demand for intelligent human-computer interaction.Voice is not only an important way for human society to exchange information,but also an important interface of human-computer interaction.It plays a key role in human life.In recent years,as the core of speech interaction,speech recognition and other technologies have attracted extensive attention.As the front-end processing of speech recognition,the establishment of an automatic speech enhancement system has become a research hotspot in recent years.The mainstream single channel speech enhancement model uses long-term and short-term memory network for timing modeling,but its timing modeling ability is not strong,which can not effectively model the timing and global context correlation of long-time series speech signals,and the convolution kernel size of codec is single,which can not extract and restore high-dimensional features more efficiently.In addition,the speech features that can be extracted by a single speech enhancement task are limited,and the generalization ability is not strong for unknown signal-to-noise ratio and speaker.In view of the above difficulties and challenges,this thesis studies the long-timeseries speech enhancement method based on multi task learning.The main work is as follows:(1)A long time sequence speech enhancement method based on complex temporal convolution and self-attention is proposed.This method uses the complex operation rules of real and imaginary parts after Fourier transform as important a priori information,designs a one-dimensional temporal convolution module based on complex operation rules,and obtains a large time-dimensional local receptive field through the linear superposition of complex one-dimensional dilated convolution of different receptive fields from small to large.At the same time,the multi-head complex self-attention module is designed to model the global context correlation of the features in the time dimension.Compared with LSTM,this method can model long time sequence and global context correlation more effectively;The structure of Selective Kernel Convolutional Encoder-Decoder is designed.In the encoding and decoding stage,two convolution cores with different scales are used to extract dynamic multiscale local features on each channel,so as to improve the ability of codec to extract and restore features.The experimental results show that compared with the existing methods,the proposed model DCSKTSN has a certain improvement in the objective evaluation indexes such as PESQ(Perceptual Evaluation of Speech Quality)and STOI(Short-Time Objective Intelligence).On the long time-series speech with a length of more than 4 seconds in the TIMIT dataset,the PESQ and STOI of the proposed model are improved by 0.1% and 2.27% respectively compared with LSTM.On VBD(Voice Bank + Demand)dataset,compared with DCCRN,the indexes of PESQ,STOI,CSIG and COVL of the proposed model are improved by 0.18,0.52,0.01 and 0.1 respectively.(2)A speech enhancement method based on multi task learning is proposed:this method introduces the idea of multi task learning,and provides additional auxiliary information for the speech enhancement task by adding the auxiliary tasks of signal-tonoise ratio prediction and speaker classification.These auxiliary information are the feature information that cannot be learned by a single speech enhancement task,It can make the enhanced network have better generalization ability and adaptive ability for unknown signal-to-noise ratio and speaker situation.Ablation experiments based on Conv-LSTM verify the effectiveness of this method.After being combined with DCSKTSN,the performance of the final model is improved by 0.77,0.49 and 0.64 compared with the existing methods in PESQ,CSIG and COVL.(3)Design and implement a long time sequence speech enhancement prototype system based on multi-task learning: design the operation interface of the prototype system with MATLAB,and design the core algorithm with Py Torch deep learning framework and Python programming language.Including speech data set upload,enhancement model training,enhanced speech waveform display and playback and other modules.Among them,the speech enhancement model adopts the long-timeseries speech enhancement method based on multi task learning proposed in this thesis.The implementation of the prototype system verified the effectiveness and practicability of the proposed method.
Keywords/Search Tags:speech enhancement, multi-task learning, complex number operation, self-attention mechanism, temporal convolutional network
PDF Full Text Request
Related items