| In the actual environment,there are various noise signals,but people are usually not disturbed by background noise and can identify the signal that they want to pay attention to from the presence of multiple sound sources.When the machine wants to realize tasks such as the recognition and translation of noisy signals,it needs to use audio signal separation technology,that is,from the audio signal mixed with multiple source signals,as far as possible to ensure that the information is not lost,restore each source signal contained in the mixed signal.Traditional audio signal separation technology is usually realized by using artificially designed audio signal features and shallow separation models.The feature engineering is often very complicated and it is difficult to ensure the effectiveness.The shallow model often has the problem of insufficient fitting ability to non-linear and complex temporal and spatial structure of audio data.Usually,in order to realize the separation task,the input of the mixed signal of multiple channels is required,which greatly limits the application of the separation model in real life scenarios.With the development of deep learning,data-driven learning has avoided the process of artificially designing features,and as the number of model layers deepens,the nonlinear expression ability of the model has also been strengthened and achieved the problem of blind source signal separation.In this paper,firstly,when the convolutional network model realizes the separation of audio signals,it usually chooses to use a fixed-size convolution kernel.In order to expand the receptive field of the model,a dilated coefficient is introduced in the convolution process,and then realized by a hierarchical stacking method.For feature extraction of audio signals,although theoretically the receptive field of the model can grasp the global information of the audio signal,the use of dilated convolution will destroy the semantic continuity of the audio signal.Although the loop network structure can realize the feature extraction work related to the context timing,but for the audio data with a longer timing structure,it will encounter the phenomenon of gradient disappearance and information loss.In response to this problem,this paper proposes a network that combines convolution and recurrent networks.The feature containing global information is extracted through the dilated convolutional layer,and then the integration of the global information is realized through the recurrent network,thereby improving the performance of the separation network.Then considering that the human ear can realize the separation and recognition of the audio signal that wants to pay attention through the attention mechanism in a noisy environment,this paper also tries to introduce an attention mechanism into the separation model to achieve the separation task.Through experiments,it is found that the time attention mechanism only plays a role in the model of the recurrent networks.The attention method of mixing channel and space can further improve the performance of the model,and through the analysis of the loss changes during the training model,it is found that the used data set is too small is also a factor that affects the separation performance of the model.Finally,based on the Generative Adversarial Network,the idea of generative adversarial is used in data enhancement and separation tasks.In terms of data enhancement,a C-RNN structured generator and a discriminator with multi-scale receptive field capabilities are used to generate data close to the sample distribution,thereby achieving the purpose of data enhancement.The data generated by the generation network is verified on the separation network,proving the validity of the generated data.In the separation task,the generative adversarial network plays different roles according to different training methods.For non-endto-end training methods,the generative network plays a role in speech enhancement,and for end-to-end separation methods,the generative adversarial network plays a similar role of the decoder.The final experimental results show that generating a separation model of the generative confrontation method can improve the effect of audio separation. |