Font Size: a A A

Research On Monaural Voice And Accompaniment Separation Using Deep Learning

Posted on:2019-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:H M LiangFull Text:PDF
GTID:2348330569487832Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Monaural singing voice separation problem is a type of source separation problem,with the intention of separating vocals and accompaniments from a mixed signal.Separation can naturally be expressed as a supervised learning problem.With the rapid development of machine learning,methods based on supervised models have become research trends in recent years.Deep neural networks,such as convolutional autoencoders,have significantly improved the performance of monaural singing voice separation problem.The input to the neural network is usually the magnitude spectrogram or features extracted from it.The output has two choices.One is the magnitude of the vocal spectrogram,and the other is the time-frequency mask.Due to the wider dynamic range of the spectrogram,previous methods tend to predict the time-frequency mask.When estimating vocal spectrogram,the model needs to perform complex nonlinear operations to remove the frequency components of the accompaniment,and output the frequency components of the human voice with small distortion.Increasing the number of convolutional layers and pooling layers can enhance the neural network's non-linear processing capability,but also introduces more distortion.In response to this problem,we propose the use of U-Net to handle the separation.U-Net adds merge layer and cross-layer connection to the convolutional autoencoder.The cross-layer connection connects two non-adjacent layers,so the output can acquire high-resolution features that are not pooled.In addition,in order to avoid overfitting,we propose a data augmentation method for separation problem.We design a series of experiments to demonstrate the characteristics of the U-Net method.By carrying out experiments on the iKala dataset,we demonstrate that the separation performance of U-Net is always better than autoencoder at the same depth.Increasing the depth of U-Net can improve the separation performance.Moreover,the separation performance of predicting magnitude is better than the mask.When estimating magnitude,choosing KL-divergence as cost function achieves better performance than mean square error.We also conducted an evaluation on the DSD100 dataset.The U-Net based method achieves the third place without additional processing of the separated voice.Compared to other state-of-the-art approaches,U-Net has the advantages of a simpler separation framework,lower delay,faster speed,and a smaller number of weights.Finally,we propose to visualize the separation network in the form of video for the first time.The video can reflect the changes of the hidden layer outputs with different audio inputs.We find that U-Net can extract distinctive audio features.
Keywords/Search Tags:deep learning, neural network, monaural audio source separation, autoencoder, U-Net
PDF Full Text Request
Related items