| Talking face generation is an artificial intelligence application based on deep learning technology,aimed at synthesizing realistic talking face videos based on given visual identity information and audio clips,ensuring that the mouth movements are synchronized with the audio.In recent years,the development of deep learning and the establishment of large-scale datasets have made talking face generation based on images widely used and increasingly focused on.Currently,researchers’ interest in talking face generation tasks is mainly due to people’s demand for high-quality virtual human images.In applications such as computer games,virtual reality,and video conferencing,highly realistic virtual faces are crucial to improving the user experience.However,traditional 3D modeling and animation techniques often require a lot of manpower and time,making them costly.Talking face generation technology can automatically learn and simulate the movement,expression,and speech linkage of real faces through machine learning and computer vision technology,thereby generating highly realistic virtual faces.The development of talking face generation technology can provide new possibilities for human-machine interaction,medical simulation,virtual education,and other fields.However,talking face generation technology faces many challenges,such as how to ensure the coordination between speech and facial expressions during the generation process and how to improve the clarity and realism of generated images.In response to these challenges,in recent years,with the continuous development of neural network architectures,deep learning-based talking face generation methods have also been continuously improved,such as the introduction of conditional generation models,generative adversarial networks(GANs),and other technologies.These technologies can effectively improve the quality and diversity of talking face generation,making the generated talking face videos more realistic and lifelike.However,current methods still have some problems,such as inaccurate lip movements,poor image quality,and insufficient emotional synchronization.These problems seriously affect the quality and realism of generated videos,thereby limiting their wide application in practical use.Inaccurate lip movements may cause discordant audiovisual effects in the generated videos,poor image quality may cause problems such as blurring or distortion in the generated videos,and insufficient emotional synchronization may cause the generated videos to fail to accurately convey the theme or emotion.Solving these problems requires in-depth research and development of more advanced and reliable video generation technologies to improve the quality and realism of generated videos.This article proposes a new two-stage audio emotion-aware talking face generation framework.This framework can generate high-quality,realistic,and emotionally synchronized talking face videos,including accurate lip movements and emotional expressions.Specifically,we adopt a series of innovative technologies,such as cross-modal emotion landmark generation network,coordinated visual emotional representation,and feature adaptive transformation module,etc.In the first stage,we propose a sequence-to-sequence cross-modal emotion landmark generation network for generating accurate emotional synchronized lip movements and facial landmarks.This network generates lip movements and emotional information that matches the audio by learning the semantic relationship between audio and image,thus ensuring that the generated videos have good emotional synchronization.At the same time,we also propose a coordinated visual emotional representation method to improve the feature extraction ability of audio emotion features,thereby better fusing with image features.In the second stage,we design a feature adaptive transformation network to transform the generated emotional synchronized lip movements and landmarks into high-quality,realistic talking face videos.In order to improve the image quality and reduce the problem of inaccurate lip movements,we propose a feature adaptive transformation module to fuse high-level features of lip movements and images.Using this method,we can effectively reduce distortion and blurring in the generated video,while improving its realism and credibility.We conducted extensive experiments on two large-scale speaker facial video datasets,MEAD and CREMA-D,to evaluate the effectiveness of the proposed model.We compared our model with the state-of-the-art models and conducted ablation studies on the functions of each part of the model and visualized some high-dimensional features.Compared with the latest academic achievements,our model achieved the best performance in multiple indicators and visual results.In addition,we released the code,data,and parameter settings of the experiments to promote further research and development of speaker facial expression generation tasks. |