Font Size: a A A

Research On Whisper To Normal Speech Conversion Based On Deep Neural Networks

Posted on:2022-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2518306542463834Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Whisper is a common way of language communication,which is widely used in places where noise is prohibited because of its low loudness and weak energy.At the same time,sound voiced by patients with throat damage also has similar acoustic characteristics to whispers.Owing to the lack of vocal cord vibration information,the speech intelligibility and naturalness of whispers are low.However,whisper contains complete semantic information even in the case of minimal energy,making it be an essential human-computer interaction interface.Converting whispers to normal sounds(i.e.,whisper conversion)is a useful way to understand the semantic information of whispers.Therefore,many researchers and scholars give great attention to the research of whisper conversion.This thesis mainly focuses on the whisper to normal conversion method based on deep neural networks.The major worksare as follows:Firstly,existing methods cannot use the local mode information of the time-frequency spectrum of speech and the long-term correlation of speech signals effectively.At the same time,existing methods lack the analysis of the acoustic characteristics of whispers in the phase of speech fundamental frequency estimation.In order to solve the problems,this thesis proposes a deep convolutional recurrent neural network model(CRNN),which uses the characteristics of the convolutional neural network(CNN)to extract the spectrum mode;At the same time,the Dilated Convolutional Neural Networks(DCNN)is used in the model to increase the model's receptive field,enabling the model to model the long-term correlation of speech effectively.In speech fundamental frequency estimation part,the prosody information generated after the fundamental frequency is decomposed by continuous wavelet transform and used as the training target of the fundamental frequency estimation model.The experimental results show that the speech converted by the whisper conversion method based on the proposed CRNN model structure has better quality and intelligibility than the speech converted by conventional methods.Secondly,although speech converted by the CRNN model shows good quality,the method relies on Dynamic Time Warping(DTW)to align the speech data in the training set.In actual application environment,aligning large amount of corpus is very difficult,and the aligned whisper with a large difference in duration will cause speech quality degradation,which affects the performance of the model.In order to solve problem caused by DTW and make the model be effectively used in real application context,we propose a sequence-to-sequence whisper conversion method with an attention mechanism based on the fact that the attention mechanism can learn the implicit alignment of the feature sequence.The neural networks are adopted to capture speech features,and attention mechanism is used to learn the alignment information between whispered speech and its parallel normal speech.The experimental results show that the sequence-to-sequence whispering conversion method proposed in this thesis has better performance than baseline methods.
Keywords/Search Tags:Whisper conversion, Convolutional recurrent neural networks, Sequence to sequence, Attention mechanism
PDF Full Text Request
Related items