Font Size: a A A

Single-channel Speech Enhancement Method Based On Deep Neural Network

Posted on:2022-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiFull Text:PDF
GTID:2518306752993259Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Speech communication is an essential and critical part of human activities.Meanwhile,with the advancement of technology,automatic speech processing systems such as speech recognition,emotion recognition,and speaker identification systems have been widely used,in addition,increasingly human-computer interaction devices are adopting speech as the main interaction method.However,in real life,speech is inevitably disturbed by various noises,which can significantly degrade the performance of various speech processing systems.Therefore,it is necessary to investigate speech enhancement systems that can attenuate the effects of noise and interference on speech.After years of development,speech enhancement systems have achieved certain results,but there are still many shortcomings,such as:(1)the performance of the speech enhancement system is still not ideal enough;(2)most speech enhancement systems require high training time for training data.This makes the generalizability and flexibility of speech enhancement systems still insufficient.To improve the performance of speech enhancement systems and reduce their requirements for training data and training time,this study proposes a new single-channel speech enhancement method based on deep neural networks,with the following main contributions.(1)A new speech enhancement network,called parallel convolutional recurrent network(PCRN),is constructed by adding normalized gated linear units(NGLU)and parallel structure to the convolutional recurrent network(CRN)-based speech enhancement system.the network consists of a convolutional autoencoder,a bidirectional gated recurrent unit(BGRU),the parallel recurrent layer structure and a post-processing module.Among them,the convolutional autoencoder consists of a stack of NGLUs,the BGRU module is used to further model the features,the parallel recurrent layer structure processes both the original and the encoder-processed speech features,and the post-processing module is better able to process the output of the parallel structure.This structure can better extract the noise-independent features from speech and can reduce the demand for training data of network while improving the convergence speed and performance.(2)The PCRN network is further improved by introducing the temporal convolutional network(TCN)module and the frequency-domain adaptive attention(FAA)module to obtain the APCRN network,which solves the problems of unbalanced architecture and large network size of the PCRN network.Among them,the TCN module can improve the flexibility and stability of the network,while the FAA module enables the network to better learn the frequency context information.This improvement further improves the performance while reducing the computational cost of the network.Experiments demonstrate that(1)compared with the CRN-based speech enhancement system,PCRN improves 36.92%,10.49% and 5.59% in three evaluation metrics,PESQ,STOI and SNR,respectively,while the convergence speed increases by 62.36%;(2)APCRN further improves the performance on top of PCRN in terms of PESQ,STOI and SNR by9.71%,9.16%,and 8.48%,respectively;(3)the simplified APCRN-Lite with a pre-trained pre-processing network and a smaller main network still achieves better performance compared to the PCRN and multiple baseline models.
Keywords/Search Tags:single-channel speech enhancement, parallel networks, multi-stage learning, low-resource training
PDF Full Text Request
Related items