| Music source separation is one of the most important research topics in the field of music information retrieval.Its main goal is to extract one or more target sources and suppress other sources and noise at the same time.As a preprocessing step of a large number of music information retrieval tasks,result of music source separation has a great influence on the subsequent tasks,and therefore has important research value.Traditional music source separation methods have faced some problems,such as hypothesis dependence,limited model complexity,and the lack of representation ability.To resolve these problems,the end-to-end time-domain deep learning network model takes a long time to be trained,and the separation performance still needs to be improved.In order to further modify the representation ability and computational efficiency of the end-to-end time domain separation model,based on the state-of-theart Demucs model in time domain separation at present,we proposed an end-to-end network Unet-SE-BiSRU.The model proposed in this thesis is mainly improved at the following three points.Firstly,the bidirectional long short-term memory was refined to a bidirectional simple recurrent unit,which reduces the amount of model parameters,improves the parallelism of learning further,and greatly reduces the total training time of the model.Secondly,the attention mechanism is introduced in the generalized coding and decoding layer,and the squeeze excitation block is used to extract features selectively according to the type of audio to be separated,so that the waveforms of different target audio sources can be represented more precisely and the separation performance can be improved.Finally,after one-dimensional convolution,group normalization is added to address the problem of gradient explosion or disappearance in the process of learning,so as to accelerate the convergence of the model.Through comprehensive data experiments,the optimal parameters of the model were determined,three refined points were verified,and performance of our model was compared with the current optimal end-to-end model demucs and other typical models in this field in the MUSDB18 database.The experimental results show that the average measure of signal to distortion ratios of the improved network model is improved by0.34 DB,which is the best separation performance among the end-to-end time domain methods to the best of our knowlege at present,and the training time is decreased to 2/5of the original model.In addition,as to drum and bass sound source,the model has the best separation performance,and is comparable to the optimal separation model according to average signal-to-noise ratio.As the number of channels in the model can be further increased under the same computing power constraint,the model has great potential in performance improvement. |