Font Size: a A A

Monaural Singing Voice And Accompaniment Separation Research Based On U-Net Architecture

Posted on:2021-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:H B GengFull Text:PDF
GTID:2518306128475944Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
As science technology and multimedia technology develop,people desire for better experience when it comes to music.As a front-end module of music processing,monaural singing voice separation lays the groundwork for application systems design such as fundamental frequency estimation,lyrics recognition,lyrics synchronization,singer recognition,music retrieval,and Karaoke system.In recent years,we have achieved substantial progress in Monaural singing voice separation.The mainstream separation methods include traditional machine learning methods and deep learning methods based on neural networks.This paper studies two deep learning methods based on U-Net architecture: a monaural singing voice separation model based on recurrent U-Net architecture and time-frequency masking;and an end-to-end monaural singing voice separation model based on gated recurrent U-Net.The following shows details:First of all,based on the advantages that the U-Net network architecture,time-frequency masking processing methods,and Discriminative Training Network have in singing voice separation,this paper studied the monaural singing voice separation model based on recurrent U-Net and time-frequency masking in depth.This model takes two-dimensional spectrum information as features,and this model reduces the semantic gap of the spectrum features between sub-network of the encoder and that of decoder,and simplifies optimization problems processed by the optimizer through the refactor Skip Connection architecture in the recurrent U-Net.Additionally,through Discriminative Training Network,a soft time-frequency masking function is jointly optimized to further enhance the separation performance,which can not only estimate the two source signals at the same time,but also improve the network separation performance.The experimental results showed that the monaural singing voice separation model based on recurrent U-Net and time-frequency masking in this paper achieved better separation quality than Chimera separation model and U-Net separation model when based on i Kala and MIR-1K dataset.Secondly,this paper also studied the end-to-end monaural singing voice separation model based on gated recurrent U-Net besides the above-mentioned model.The model uses a one-dimensional time-domain wave as input,which not only avoids the week point that short-time Fourier transform to obtain two-dimensional time domain spectral information depended on many parameters,but also avoids the information loss caused by the phase.And the Gated Linear Units with a gating mechanism is integrated to control the input,memory and other information,and make predictions.When the channel gradient method changed from the power gradient method to the linear gradient method during encoding and decoding,experiments showed that the amount of parameters could be greatly reduced without affecting the training results.The Difference Output Layer can conduct singing voice separation and accompaniment separation at the same time to jointly optimize the network model,so this separation model can simultaneously estimate the two source signals.The experimental results showed that,when based on in i Kala and MIR-1K dataset,the end-to-end monaural singing voice separation model based on gated recurrent U-Net in this paper performed better than the monaural singing voice separation model based on recurrent U-Net and time-frequency masking.And the results in i Kala dataset are close to Ideal Binary Mask(IBM).
Keywords/Search Tags:monaural singing voice separation, Encoder-Decoder architecture, U-Net network, Gated Linear Units, End-to-end
PDF Full Text Request
Related items