Font Size: a A A

Multi-speaker Speech Separation Based On Deep Learning

Posted on:2022-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:C L WangFull Text:PDF
GTID:2518306605468724Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence technology and Internet technology,more and more speech interaction scenarios appear in people's daily lives.Speech separation technology is one of the most challenging research tasks in the field of voice signal processing.The application in the actual complex acoustic scene plays an important role.Since the introduction of deep clustering and permutation invariant training,the problem of label permutation has been solved.In recent years,neural networks have been further applied in speech separation,and the performance of multi-speaker speech separation tasks with irrelevant speakers has been improved.The time domain speech separation network is proposed to solve the phase mismatch problem in the waveform recovery step after the speech separation in the frequency domain,and build a feature that is more suitable for speech separation.Although the above-mentioned deep learning-based speech separation methods have developed rapidly in recent years and continue to make breakthroughs,there are still the following problems: First,most scholars currently work on single-channel speech separation tasks on pure mixed speech data sets.As mentioned above,the performance drops significantly when noise is included.Second,although the current single-channel speech separation method has a certain modeling ability for the information of the speech context,it does not have the global feature correlation ability and does not make full use of the speech contextual information and dependencies.In response to the above problems,this paper work on the mixed speech data set containing noise,combined with time convolutional network and attention mechanism,proposed a single-channel speech separation network,and obtained the speech separation performance on the data set containing noise.and have a more obvious improvement in the case of low signal-to-noise ratioWith the improvement of microphone technology and the improvement of speech system requirements for speech quality,multiple microphones have gradually been deployed in more scenarios.How to use the spatial information collected by the microphone array to improve the performance of speech separation is of important research value.At present,most of the multi-channel speech separation methods based on deep learning are directly extended on the single-channel speech separation system.The spatial information collected by the microphone array and the speech waveform information are spliced together as the input of the separation network.Researchers proposed some methods for optimizing spatial information to improve the separation of multi-channel speech.However,there are still some problems: first,when the sound source angle difference is small,it will cause spatial information aliasing;second,only the waveform information of the reference channel is used in the recovery step of the waveform,and the spatial information is not fully utilized.In response to the above problems,this paper proposes a multi-channel speech separation framework in the spatial-temporal feature domain.The time-domain waveform information and spatial information are used in the speech waveform recovery step.Based on this framework,two speech spatial-temporal feature encoders are proposed.Compared with the latest multi-channel speech separation method,the performance of this method is significantly improved.
Keywords/Search Tags:Speech separation, Attention mechanism, Multi-channel speech separation, speech spatiotemporal coding
PDF Full Text Request
Related items