Font Size: a A A

The Research Of Time-Domain Speech Separation With Blind Source Based On Convolutional Neural Network

Posted on:2022-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:H Y SunFull Text:PDF
GTID:2518306320975499Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Speech signal is one of the most important media in human communication,which has always been the focus of academic and industrial circles.In a speech signal with multiple speakers at the same time,the discrimination and pursuit of the pure speech information of each target speaker is an extremely important research direction in the speech field.The multi-target separation problem is usually transformed into a blind source separation problem.The system needs to calculate the source signal by processing mixed signals in the absence of a priori of the source signal and transmission channel imformation.The common method of speech sparation is frequency-domain representation which have some problems,such as the decoupling of signal phase and amplitude,the suboptimal representation of speech separation,and the high time delay of spectrum calculation.In order to explore the method to deal with the above problems,this paper re-models the long-term dependence of speech signal input in the time-domain convolution operation of the original fully-convolutional network.First of all,to make up for the iteration loss of valid data caused by zero padding,the new time convolution block will adopt a new mask layer into the feature mapping.This kind of high efficient feature extraction network can improve the situation of the original time space redundancy.In order to improve the separation effect,the network system has half of the data redundancy and overlap in the adjacent segments of the speech data slicing,which improves the separation effect by means of the high time cost of multiple calculations.In the process of the output hiding layer of the new network,more effective data is used to replace the zero padding to increase the convolution participation rate of the underlying data.in the presence of adjacent segment is designed as a third of the overlapping data.When one third of the data overlay between adjacent segments and the total input data is reduced by 25%,the separation module with complementary padding can ensure that the network training time is basically not increased,and the separated target speech will perform better.Secondly,according to the nature of linear uncorrelation between the separated speech codes of different people,the system will make a subtraction operation through the original mixed speech code and the estimated target code to generate another pair of estimated codes of the mirror image.The single edge of a coding segment and the adjacent segment with offset overlap generate four groups of codes which are symmetric and adjacent to each other through the subtraction operation of mirror image,and the integrated target speech can be recovered after the convolution estimation and fusion of the four groups of codes matching and the output of the decoder.Finally,the test mixed speech sets were separated through experiments,and the SNR of the two estimated separated target speech was calculated.Compared with the existing methods,the separation module was improved by no less than 0.6 percentage points by using the method of filling by filling.The design of mirrored multi-source data fusion can improve the network by one thousandth.
Keywords/Search Tags:Speech separation, Deep neural network, End-to-end model, Temporal Convolutional Network, Time-domain, Supplement padding
PDF Full Text Request
Related items