Font Size: a A A

Monaural Singing Voice Separation Using Deep Learning

Posted on:2021-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2518306113451624Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Monaural singing voice separation refers to separating expected accompaniment and singing voice from monophonic song.On the one hand,it can provide reference and research basis for monophonic multi-source separation,and on the other hand,it can be applied to music information retrieval.Because the singing voice and accompaniment are intertwined in the spectrum,it brings challenges to the monaural singing voice separation.Much research has focused on finding ways to make a clear distinction between accompaniment and singing voice.Traditional non-deep learning algorithms distinguish accompaniment and singing voice based on prior knowledge of audio processing,which lack flexibility.And it is difficult to find an algorithm that is suitable for all types of songs.With the development of deep learning,neural networks learn distinguishable features automatically and fit the relationship between input and output,showing better separation quality than non-deep learning methods.Due to the remarkable performance of deep learning on the image and more available information in the frequency domain,most of the existing methods model the problem in the frequency domain.The separation algorithms based on frequency-domain are divided into the following steps.Firstly,take the spectrogram of the song as input.Then build the suitable neural network to predict the spectrogram of the singing voice and accompaniment.Finally combine the spectrograms predicted by the network with the phase of the song to reconstruct the accompaniment and singing voice in the time domain.At present,the separation algorithms in the frequency domain meets the bottleneck,mainly reasons are those:(1)Since the neural network used for singing voice separation is a serial structure,the down sampling effect results in the loss of some feature information,which restricts the accuracy of the predicted spectrograms.(2)Since the human ears are not sensitive to the phase of signals,the phase of the song is used to represent the phase of accompaniment and singing voice approximatively in the reconstruction stage,and the phase modeling is neglected.Although the phase has little effect on the separation performance,the mismatched and inaccurate phase ultimately limits the separation performance.Based on the research status of monaural singing voice separation,this thesis focuses on the neural network structure and the accuracy of phase,and proposes a set of monaural singing voice separation algorithm to improve the quality of the separated signal.The main contents and innovations of this article are as follows:(1)This thesis proposes a complete algorithm for monaural singing voice separation based on High-Resolution neural network.Because of the parallelism of the high-resolution network structure,there are always high-resolution feature maps and other resolution feature maps so as to eliminate the problem of information loss caused by the serial network down-sampling effect.Then fuse feature maps repeatedly to generate new semantics so that networks can learn comprehensive,high-precision,high-abstract features.In this paper,we adopt the High-Resolution neural network(HR-Net)to model in the frequency-domain,making the magnitude of the predicted spectrograms maintain high-precision in detail.Experiments conducted on the MIR-1K dataset show that SDR,SIR and SAR indicators of the proposed algorithm are better than the state-of-the-art algorithm,confirming the effectiveness of the algorithm proposed in this paper.(2)This thesis proposes a phase-estimation strategy.To address the problem of inaccurate phase in frequency-domain model,a phase-estimation strategy is proposed,which takes advantage of the time-domain model.The strategy estimates the phase spectra of the separated signals based on the magnitude of separated spectrograms,the phases of the original song and the phases of the signals separated by time-domain model,effectively alleviating the problem of phase inaccuracy.With the addition of phase estimation,the experiments on the data set MIR-1K show that the SDR and SIR indicators have been significantly improved,proving that accurate phase can reduce the interference from other signals and obtain better target signals.To ensure the consistency of magnitude between the original song spectrogram and predicted accompaniment and singing voice spectrograms,this thesis adopts soft time-frequency mask to constrain the magnitude of predicted spectrograms.The mask constrains the sum of the magnitude of accompaniment and singing voice spectrogram to be equal to the magnitude of original song spectrogram,making the predicted results are more accurate and closer to the real target signal.Experiments show that the magnitude constrain can improve algorithm performance.
Keywords/Search Tags:Monaural singing voice separation, High-Resolution neural network(HR-Net), Spectrogram, Phase estimate, Deep learning
PDF Full Text Request
Related items