Font Size: a A A

Research On Sequence-to-sequence Acoustic Modeling For Speech Generation

Posted on:2022-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X ZhangFull Text:PDF
GTID:1488306323464204Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Sequence-to-sequence model is a type of deep learning based statistical model,which can be used for modeling the conditional probability of an output sequence given an input sequence.In recent years,sequence-to-to-sequence models have been per-formed well in many research field,including speech recognition,natural language processing and so on.As Tacotron model was proposed by Google's researchers in 2017,sequence-to-sequence model has been widely applied to speech generation tasks and it has achieved impressive results.The main advantage of sequence-to-sequence model over the conventional model lies in its flexible framework,which can model the relationship between any kinds of sequence pairs theoretically.And it gets rid of the unreasonable assumption of time-independency of conditional probability like tra-ditional hidden Markov model(HMM),by modeling the probability of a sequence in an auto-regressive manner.The framework of sequence-to-sequence modeling doesn't define every details of the model.According to the property of the input data stream,different tasks can use various neural networks to construct different parts of sequence-to-sequence models,like long-short term memory or convolutional neural network.Speech generation enables a machine to produce speech flexibly,and it's an impor-tant part of human-computer interaction.Therefore,it has many applications and im-portant research significance.The speech generation tasks on which this thesis focus in-clude text-to-speech synthesis,voice conversion and articulation-to-speech generation.Although the difference of input data between the tasks,they share the same objective of producing realistic speech.Sequence-to-sequence model can be applied to all the tasks studied in this article.Recent years,the application of sequnce-to-sequence mod-eling to speech generation has made significant progress.However,there're still some problems remaining unsovled.Attention mechanism in sequence-to-sequence model is unstable,and it may cause mispronunciation in generated speech;sequence-to-sequence model hasn't been applied to parallel or non-parallel voice conversion successfully;articulation-to-speech faces the challenge of data-sparsity issue,and so on.This arti-cle aims at these problems,conducts research based on sequence-to-sequence acoustic models for speech generation,and improves model's performance on corresponding speech generation tasks.This main research content of this article includes:First,this article studies the attention machanism in text-to-speech synthesis.The instability issue exists in the sequence-to-sequence based text-to-speech models.Mis-pronunciation and repeated pronunciation often ocur.Inspired by the monotonic na-ture of alignment between text and speech,this article proposes a forward attention method for sequence-to-sequence acoustic modeling in speech synthesis.The experi-ments proved the proposed method effectively improved the stability of sequence-to-sequence speech synthesis.Second,this article studies the sequence-to-sequence based method for voice con-version.For parallel voice conversion,this article presents a sequence-to-sequence voice conversion model,which can adjust the prosody of the speech like speaking rate.Thus the naturalness and similarity of converted speech can be effectively improved.For non-parallel voice conversion,this article proposes a seqeunce-to-sequence model based on the feature disentangling.The proposed model adopts an adversarial learn-ing strategy and a strategy of learning joint space with text input,which separates the speaker and linguistic information in speech signals effectively.It can be applied to non-parallel voice conversion,and the naturalness and similarity are even close to those of parallel sequence-to-sequence method.Besides,this article presents a recognition-synthesis based voice conversion method with adversarial learning and a voice conver-sion method by casacading an automatic speech recognition and text-to-speech system directly.Third,this article studies the articulation-to-speech conversion task.This study adopts ultrasound tongue images and optical lip images as articulation features.And the objective of this task is to reconstruct the natural speech.In order to handle with the data-sparsity issue of articulation features,this article presents a transfer learning strategy from text-to-speech synthesis,which can effectively improve the intelligibility and naturalness of reconstructed speech.
Keywords/Search Tags:sequence-to-sequence model, speech generation, text-to-speech, voice conversion, articulation-to-speech, acoustic model
PDF Full Text Request
Related items