Given a single(or several)facial image and a driving source(e.g.,an audio speech or a sequence of facial landmarks),the audio-driven facial animation generation task is able to generate realistic dynamic face talking head videos corresponding to the driving source.Solving this task is essential to enable a wide range of practical applications,such as redubbing videos in other languages,telepresence for video conferencing or role-playing video games,bandwidth-constrained video conversion,and virtual hosts.Another potential application is to enhance speech understanding and help the hearing impaired better understand the content,as the combination of audio and video can help better understand the content of the message compared to the delivery of a single source of information.This thesis mainly studies the existing methods,and finds that the clarity of the generated videos of the existing methods is relatively low,and many tasks are carried out for a single feature to model the related problems.The authenticity of the modeling of a single feature is relatively poor.For the face of a human face,it is usually based on multiple features to evaluate its authenticity,such as lip consistency,facial expression,head movement and so on,so that the generated face portrait can be fake and real.Aiming at these two problems and related research work,this thesis mainly starts from the following three parts:1.Perform multi-task training,and train multiple facial features jointly.In the existing tasks,researchers pay more attention to the single feature of face,such as lip consistency.However,there are many details of the face.If only the lower part of the face(mainly the lips)is modeled and trained,the generated face portrait is relatively stiff,there is no expression change,and the head is fixed.There is nothing realistic about the resulting face portrait.And facial expression is a non-negligible part of the face,so the method in this thesis models multiple facial features(mainly lips and facial expressions),which makes the generated face portrait more vivid.2.Generate high-resolution face portraits.In the existing methods,the resolution of the generated face portrait is relatively low,so that many details of the generated face portrait are lost,which makes the generated portrait blurred.This thesis will try some methods of high resolution,so as to make the generated face portrait have more texture,making the generated face portrait clearer.3.This thesis mainly focuses on the research based on the existing two methods of face portrait generation: dynamic face portrait generation based on 3DMM and dynamic face portrait generation based on facial landmarks.In the generation of dynamic face portrait based on 3DMM,the speech information is processed into 3DMM coefficients,and then the 3DMM coefficients are used to generate the face image.The dynamic face portrait generation based on facial landmarks is to generate facial landmarks by speech information prediction,and then generate the target face portrait from the predicted facial landmarks.Both methods generate natural and realistic facial portraits with high quality emotions. |