Font Size: a A A

Research On Whisper To Normal Speech Conversion Based On Convolutional Neural Network

Posted on:2021-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:H L LianFull Text:PDF
GTID:2428330629480389Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Whisper refers to the low-energy pronunciation without vocal cord vibration.It is a special and essential communication style between people.For example,in places such as libraries and conference rooms where loud speaking is prohibited,people often use whisper for human-to-human communication or human-computer interaction;And in recent years,whisper has become the one of the most convenient human-computer interface compared to surface electromyography and magnetic resonance imaging interfaces in the field of human-computer interaction.It can be seen that whisper has broad application prospects.Therefore,in recent years,the study of whisper to normal speech conversion(usually expressed by whisper-to-speech conversion)has attracted much attention of researchers.This thesis mainly focuses on whisper-to-speech conversion technology based on convolutional neural networks.The major works are divided into the following two parts:First,according to investigation,it is found that the existing whisper-to-speech conversion methods can not make full use of the time and frequency domain correlation of speech for modeling.When the spectrum of adjacent consecutive speech frames is spliced into matrix,the local correlation in the time and frequency domain dimensions is very similar to the correlation between adjacent pixels in a image.The neurons in the convolutional layer in the Convolutional Neural Network(CNN)are calculated by convolutional calculations of multiple neurons in adjacent areas in the previous layer.At the same time,because the point in a certain area of the previous layer contains the time and frequency domain information of the input voice spectrum,the convolutional layer can extract the time and frequency domain correlation information implied in the voice spectrum characteristics.In order to make full use of the correlation between time domain and frequency domain of speech for modeling,this thesis proposes to use deep convolutional neural network model(DCNN)to realize whisper-to-speech conversion.Experimental results show that the converted speech obtained by the proposed DCNN model is closer to normal speech than that by the DNN model.Secondly,although the DCNN model can make full use of the time-domain and frequency-domain correlation of speech for the modeling of whisper-to-speech conversion,DCNN uses just a fully connected layer to fit the mapping relationship between features extracted by the convolutional layer and normal speech features.Because the fully connected layer treats each frame of input speech features as independent features,so the DCNN cannot further use the temporal correlation to model the features extracted by the convolutional layer.Note that,BLSTM(Bidirectional Long Short-Term Memory)can make a good use of temporal correlation,so in order to make full use of the advantages of CNN and BLSTM,this thesis propose to use Deep Convolutional Recurrent Neural Network(DCRNN)for whisper-to-speech conversion.This method has been verified on a real whisper database,and the experimental results prove that the conversion effect of the method is further improved compared with the DCNN model.
Keywords/Search Tags:Whisper-to-speech conversion, Correlation between time and frequency domain, DCNN, DCRNN
PDF Full Text Request
Related items