Research On Whisper To Normal Speech Conversion Based On Deep Neural Networks

Posted on:2022-09-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y Huang

Full Text:PDF

GTID:2518306542463834

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Whisper is a common way of language communication,which is widely used in places where noise is prohibited because of its low loudness and weak energy.At the same time,sound voiced by patients with throat damage also has similar acoustic characteristics to whispers.Owing to the lack of vocal cord vibration information,the speech intelligibility and naturalness of whispers are low.However,whisper contains complete semantic information even in the case of minimal energy,making it be an essential human-computer interaction interface.Converting whispers to normal sounds(i.e.,whisper conversion)is a useful way to understand the semantic information of whispers.Therefore,many researchers and scholars give great attention to the research of whisper conversion.This thesis mainly focuses on the whisper to normal conversion method based on deep neural networks.The major worksare as follows:Firstly,existing methods cannot use the local mode information of the time-frequency spectrum of speech and the long-term correlation of speech signals effectively.At the same time,existing methods lack the analysis of the acoustic characteristics of whispers in the phase of speech fundamental frequency estimation.In order to solve the problems,this thesis proposes a deep convolutional recurrent neural network model(CRNN),which uses the characteristics of the convolutional neural network(CNN)to extract the spectrum mode;At the same time,the Dilated Convolutional Neural Networks(DCNN)is used in the model to increase the model's receptive field,enabling the model to model the long-term correlation of speech effectively.In speech fundamental frequency estimation part,the prosody information generated after the fundamental frequency is decomposed by continuous wavelet transform and used as the training target of the fundamental frequency estimation model.The experimental results show that the speech converted by the whisper conversion method based on the proposed CRNN model structure has better quality and intelligibility than the speech converted by conventional methods.Secondly,although speech converted by the CRNN model shows good quality,the method relies on Dynamic Time Warping(DTW)to align the speech data in the training set.In actual application environment,aligning large amount of corpus is very difficult,and the aligned whisper with a large difference in duration will cause speech quality degradation,which affects the performance of the model.In order to solve problem caused by DTW and make the model be effectively used in real application context,we propose a sequence-to-sequence whisper conversion method with an attention mechanism based on the fact that the attention mechanism can learn the implicit alignment of the feature sequence.The neural networks are adopted to capture speech features,and attention mechanism is used to learn the alignment information between whispered speech and its parallel normal speech.The experimental results show that the sequence-to-sequence whispering conversion method proposed in this thesis has better performance than baseline methods.

Keywords/Search Tags:

Whisper conversion, Convolutional recurrent neural networks, Sequence to sequence, Attention mechanism

PDF Full Text Request

Related items

1	Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model
2	Research On Sequence Recommendation Method Based On Hybrid Neural Network
3	Research On Answer Generation Based On Sequence-to-sequence Model
4	The Design And Implementation Of An Automatic Image Captioning System Based On Deep Neural Networks
5	Research On Deep Learning Algorithm For Sequence Data
6	Research On Whisper To Normal Speech Conversion Based On Convolutional Neural Network
7	Research On Neural Sequence Prediction Model
8	Convolutional Sequence-to-sequence Based Neural Networks For Lip Reading
9	Research On Dynamic Emotion Recognition Based On Spatial-Temporal Neural Networks
10	Research On Image Description Method Based On Multimodal Recurrent Neural Networks