Font Size: a A A

Research On Single-channel Speech Separation Based On Deep Learning

Posted on:2022-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:J J ChenFull Text:PDF
GTID:2518306506463724Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of Internet and computer technology,intelligent interaction has gradually begun to change people's lifestyles and affected various aspects of people's lives.As one of the most important,common and convenient method of information interaction,speech plays a very important role in people's daily life.And it is one of the hot issues of intelligent interaction research.In recent years,speech interaction technologies such as automatic speech recognition and speech synthesis have received extensive attention from researchers and have been well applied in some fields.However,there are still some problems to be solved urgently.For example,multi-talker mixture usually damages the recognition performance of the automatic speech recognition system.Therefore,studying how to extract a clean speech signal for each speaker from the mixed speech has been urgent for the development of speech technology.In the field of speech separation,in addition to multi-channel speech separation technology that uses spatial information,single-channel speech separation is also a research topic due to the convenience and low cost of single-channel speech perception.There are many difficulties and challenges to separate single-channel mixed speech: Firstly,due to the lack of spatial information,the speech features of each talker are mixed in the only mixture,which are difficult to separate;Secondly,the speech is a kind of temporal signal with context.It is an important issue to model the extremely long speech sequences.In addition,separated sources usually contain some artifacts(broadband noises),which is bad for perception and intelligibility,especially when there are noises in the mixture.In view of the above three challenges,we combine deep learning technology to propose corresponding solutions.The main contents and innovations of the thesis are as follows:(1)We propose a single-channel speech separation method using weighted-generative-factors autoencoder,which uses the autoencoder to obtain the generative factors of the mixed speech and weights the factors to learn specific generative factors for each speaker through the attention mechanism.The specific generative factors are used to construct independent speech signal features for achieving speech separation.In addition,this method introduces a regularization loss into the objective function to strengthen the separation effect and improve the performance of speech separation.Experimental results verify the effectiveness of our method: the proposed method is significantly superiority to the existing related methods,where SDR(source to distortion ratio),SIR(source to interference ratio)and SAR(sources to artifacts ratio)evaluation metrics have been improved greatly.On the TIMIT corpus,compared with the existing related methods,the SDR,SIR and SAR metrics of the proposed model improve by 3.57 d B(decibel),5.92 d B and 4.53 d B,respectively.(2)We propose a single-channel speech separation method based on dual-path transformer network.Proposed model introduces the self-attention mechanism and transformer network to realize direct information interaction of each elements in the speech signal,so as to achieve the effect of direct context-aware modeling.In addition,we introduce dual-path network to perform global modeling for extremely long speech features,which maximizes the receptive field and captures beneficial information in the context fully.The experimental results show that the single-channel speech separation method based on dual-path transformer network can improve the speech separation performance significantly.On the publicly available WSJ0-2mix,our method is superior to the most advanced methods.The SI-SNR(scale-invariant source to noise ratio)metric increase by 7.4% to 20.2d B,and the SDR metric increase8.4% to 20.6d B.On the LS-2mix corpus,our proposed method also obtains a significant performance improvement.(3)We propose a single-channel noisy speech separation method based on mapping learning.Aiming at the problem that the background noise in the mixed speech may cover the clean speech signal,we use the mapping method to learn a feature representation for each speaker and recover the covered speech signal.Proposed method reduces the artifacts(broadband noise)in principle and enhances the perception of separated speech.Experiments on clean mixtures and noisy mixtures show that the method of mapping learning can recover the speech signal without extra computational overhead,which deal with the background noise problem in the process of speech separation effectively.Based on the Conv-Tas Net model in the field of speech separation,the SDR,SIR and SAR metrics of mapping learning method improve by 0.53 d B,2.01 d B and 0.54 d B respectively.(4)This thesis uses programming languages and deep learning frameworks to design and implement the prototype system,in which the languages contain Matlab,Python and the frameworks contain Tensorflow,Keras,Pytorch.The single-channel speech separation prototype system contains three modules: the module of uploading mixture,single-channel speech separation module,and the module of playing and visualizing separated speech.
Keywords/Search Tags:single-channel speech separation, deep learning, generative factors, dual-path transformer network, direct context-aware modeling, global modeling, mapping learning
PDF Full Text Request
Related items