Research On Monaural Voice And Accompaniment Separation Using Deep Learning

Posted on:2019-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:H M Liang

Full Text:PDF

GTID:2348330569487832

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Monaural singing voice separation problem is a type of source separation problem,with the intention of separating vocals and accompaniments from a mixed signal.Separation can naturally be expressed as a supervised learning problem.With the rapid development of machine learning,methods based on supervised models have become research trends in recent years.Deep neural networks,such as convolutional autoencoders,have significantly improved the performance of monaural singing voice separation problem.The input to the neural network is usually the magnitude spectrogram or features extracted from it.The output has two choices.One is the magnitude of the vocal spectrogram,and the other is the time-frequency mask.Due to the wider dynamic range of the spectrogram,previous methods tend to predict the time-frequency mask.When estimating vocal spectrogram,the model needs to perform complex nonlinear operations to remove the frequency components of the accompaniment,and output the frequency components of the human voice with small distortion.Increasing the number of convolutional layers and pooling layers can enhance the neural network’s non-linear processing capability,but also introduces more distortion.In response to this problem,we propose the use of U-Net to handle the separation.U-Net adds merge layer and cross-layer connection to the convolutional autoencoder.The cross-layer connection connects two non-adjacent layers,so the output can acquire high-resolution features that are not pooled.In addition,in order to avoid overfitting,we propose a data augmentation method for separation problem.We design a series of experiments to demonstrate the characteristics of the U-Net method.By carrying out experiments on the iKala dataset,we demonstrate that the separation performance of U-Net is always better than autoencoder at the same depth.Increasing the depth of U-Net can improve the separation performance.Moreover,the separation performance of predicting magnitude is better than the mask.When estimating magnitude,choosing KL-divergence as cost function achieves better performance than mean square error.We also conducted an evaluation on the DSD100 dataset.The U-Net based method achieves the third place without additional processing of the separated voice.Compared to other state-of-the-art approaches,U-Net has the advantages of a simpler separation framework,lower delay,faster speed,and a smaller number of weights.Finally,we propose to visualize the separation network in the form of video for the first time.The video can reflect the changes of the hidden layer outputs with different audio inputs.We find that U-Net can extract distinctive audio features.

Keywords/Search Tags:

deep learning, neural network, monaural audio source separation, autoencoder, U-Net

PDF Full Text Request

Related items

1	Underdetermined Speech Separation Based On Sparse Representation And Deep Learning
2	Monaural Singing Voice Separation Using Deep Learning
3	Research On Monaural Speech Separation Technology Based On Deep Learning Joint Optimization And Feature Fusion
4	Research On Separation Of Audio Signal Based On Deep Neural Network
5	Research On Monaural Speech Separation Technology Based On Deep Learning Multiple Constraint And Channel Attention Mechanism
6	Research On Audio Mixed Signal Separation Method Based On Deep Neural Network
7	Research For Self-attention Based Audio Source Separation Model
8	Research On Mono Source Separation Algorithm Based On Feature Enhancement And Data Enhancement
9	Research On Key Technologies For Multi-source Separation With Deep Neural Networks
10	Research On Audio Spatialization Based On Visual Information