Font Size: a A A

Research On Auto-regressive Deep Neural Networks' Based Monaural Speech Separation

Posted on:2020-04-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:C X LiFull Text:PDF
GTID:1368330572478894Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech separation still remains one of the most challenging problem in speech sig?nal processing field.In today's era when computer and Internet technologies have been greatly developed,speech separation plays a very important role in speech communica-tion and human-machine speech interface,directly affecting user experience in complex acoustics environments.Since the 1950s,researchers have begun to study monaural speech separation approaches.Before the emergence of deep learning techniques,tra-ditional monaural speech separation approaches such as signal processing and statistical modeling based methods,computational auditory scene analysis,non-negative matrix factorization and hidden Markov models were proposed.However,due to various un-reasonable assumptions or manually designed heuristic rules and other shortcomings,the separation performances of these methods are limited.Meanwhile,deep learning based monaural speech separation approaches do not need these assumptions or rules.Instead,they can use powerful neural networks and sufficient training data to model and learn the complex dependencies between mixture and target speech,so as to obtain bet-ter separation performance than traditional approaches.Recently,with the latest deep clustering and permutation invariant training methods proposed,the label permutation problem,which is very difficult for conventional deep learning based approaches,has also been solved.Although the separation performance of deep learning based methods has been improved significantly compared with the traditional methods,there are still some dis-advantages.The first main disadvantage is that,the network structures used usually do not make full use of the temporal context information and dependencies between mixture,target speech and interference signal,and have limited temporal memory ca-pabilities.The second one is the commonly adopted training criterion,minimum mean square error criterion,will cause spectral over-smoothing problem.Thirdly,the lat-est deep clustering and permutation invariant training methods usually use non-causal network structures to achieve the best separation performance at the expense of large time delay,so they cannot be applied to online separation scenarios.Moreover,the best causal network structure has a significant performance gap compared with non-causal network structures.Therefore,to address these shortcomings,on the basis of conventional deep learning based monaural speech separation methods,this paper studies new solutions,focusing on two sub-tasks of speech separation,speech enhancement and speaker-independent multi-speaker speech separation,and proposes an auto-regressive deep neural networks based approach.Firstly,in speech enhancement,aiming at the disadvantage that conventional re-gression deep neural networks based speech enhancement methods have not fully uti-lized the temporal context information and dependencies between mixture and target speech,and the disadvantage of spectral over-smoothing problem by using minimum mean square error criterion,this paper proposes an auto-regressive neural network based speech enhancement method.The method can effectively model the relationship be-tween the signals with the proposed network,and adopts a training scheme which com-bines the adversarial training and the proposed multi time step prediction training.It not only alleviates the mismatch between the training stage and the enhancement stage,but also improves the speech enhancement performance and alleviates the spectral over-smoothing problem.Secondly,in speaker independent multi speaker speech separation,to solve the la-bel permutation problem encountered by regression deep neural networks based meth-ods,and to address the limitations for the latest deep clustering and pennutation invari-ant training methods in online separation scenarios,based on the research on human auditory perception mechanism and computational auditory scene analysis,this paper proposes an auto-regressive neural network based speaker independent multi speaker speech separation method.By using the proposed listening stage and grouping stage networks,the method can fully utilize the temporal context information and dependen-cies between mixture and all source signals to address label permutation problem with a new idea,and has achieved state-of-the-art separation performance among online meth-ods.Finally,this paper expands and improves the proposed methods mentioned above.Through further analysis of speech separation,to address the drawback of using mag-nitude information but do not make full use of phase information in conventional short time Fourier transform based approaches,and the drawback of ignoring long-term speaker information in most recent methods,this paper proposes waveform domain end-to-end modeling with waveform spare coding and speaker information assisted training methods to fully utilize phase information in the wavefonn and to extract and memo-rize speaker information at the same time.With the further improved network structure,the method finally achieved better separation performance compared with former men-tioned method.
Keywords/Search Tags:Speech separation, Speech enhancement, Multi speaker speech separation, Auto-regressive deep neural networks, adversaraial training, Waveform domain modeling
PDF Full Text Request
Related items