Font Size: a A A

Research And Implementation Of Chinese Text-driven Talking-face Generation Method

Posted on:2023-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:W SunFull Text:PDF
GTID:2568307058999529Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text-driven talking-face video generation aims to use given text to generate a talking-face video with corresponding speech.This technology is a fusion of speech synthesis technology and talking-face video generation technology.Synthesizing speech through text,and then driving the generation of talking-face video through speech feature,so as to realize the conversion of text to speech and video.The automatic generation of speech and video is widely used in current film production,video games,virtual anchors,etc.The multi-modal conversion from text to voice and video is limited by the modal heterogeneity and the data noise,etc.,which makes it difficult to learn the mapping relationship among different modalities.At present,the talking-face video generation technologies are mostly driven by audio,lacking certain flexibility and controllability,and lack of restrictions on the correlation between lips and speech,resulting in low lip synchronization in the generated video.In addition,most of the research on speech synthesis technology focuses on the application in English scenarios,and there are huge differences between Chinese and English,so there is still a lot of room for exploration of speech synthesis technology in Chinese scenarios.In order to solve the above problems,this thesis conducts the research on Chinese text-driven talking-face video generation technology,and builds a text to speech and video generation system,the main work and contributions of this thesis are summarized as follows:(1)This thesis constructs a Chinese multimodal news dataset CCTV-NEWS,which contains data from three modalities: text,speech and video.We collected video data from CCTV’s "Xin Wen Lian Bo" for nearly three years,developed and used semi-automated data processing tools to perform scene segmentation,screening and text extraction on source videos,which shortens the production cycle of the dataset.(2)In terms of speech synthesis technology,this thesis designs a non-autoregressive Chinese speech synthesis network based on Transformer.This network abandons the attention alignment mechanism of the traditional autoregressive method,and uses Gaussian-upsampling to perform alignment,so as to realize the parallel synthesis of the speech Mel-spectrogram.At the same time,the network adopts the method of prosodic filtering to realize the modeling of the prosodic information of phonemes.The comparative experiments on the self-built dataset and the Baker public dataset verify the effectiveness of the speech synthesis network designed in this thesis.(3)In terms of talking-face video generation technology,this thesis firstly studies the generation method using landmarks of face as intermediary features,and optimizes its loss function.The comparative experiments on the self-built dataset CCTV-NEWS and the public dataset GRID verify the effectiveness of the optimized loss function.In addition,this thesis further optimizes the talking-face video generation method,designs an end-to-end talking-face video generation network based on the GAN model,and adds a pre-trained labial synchronization discriminator(LSD)as an additional lip synchronization loss.At the same time,in order to improve the performance in the Chinese scene,the model is fine-tuned on the selfbuilt Chinese dataset.Extensive experiments are carried out on the self-built dataset CCTVNEWS and public dataset LRS2.The quantitative and qualitative results demonstrate the effectiveness of the method.The model can be applied to the input voice of any target character to generate lip-sync talking-face video.
Keywords/Search Tags:Speech synthesis, Talking-face video generation, Dataset, Generative adversarial network
PDF Full Text Request
Related items