| Motion transfer algorithm was wide used in multiple domains,including video effects,virtual conference,fake video generation.Classical motion transfer method what is complicated needs professional technicians and expansive devices.Benefit from the development of deeplearning method,more and more researches resolve this task by using deep model to reduce the cost of current method and enhancing algorithm usability.Warp-based methods and image generation methods predict accurate warp field by using manually annotated data what is expansive.Other shortcomings of warp-based methods include “predicting inaccurate motion information”,“generating unnatural results” and “low inference speed”.Besides,the main problem of disentangling-based methods is that current methods can not completely disentangle images,this result in terrible results.In this paper,we proposed two motion transfer algorithms based on self-supervised learning.One is warp-based method and another is disentanglingbased method.(1)self-supervised motion transfer algorithm based on image affine transformations:considering current methods can not represent accurate motion information,we use multiple affine transformation matrices to represent the motion changes of objects between image pairs.Simple affine transformation matrices can effectively encode motion information.Our experiments prove that affine matrix can capture pixel displacement caused by subtle motion changes as well as large-scale motion changes.Our model can capture accurate motion information and generate clear,natural images.We design a minimal structure for regressionnet what predicts affine parameters.The fewer number of parameters allows our model to have faster inference speed.(2)self-supervised motion transfer algorithm based on image disentangling: Being different from current methods what construct image pairs by warping image manually,our model prepares training data pairs from video dataset to train whole model with self-supervised learning method.By doing this,our model learning from real motion distribution can process more complex input.In our method,we represent pose information by masks.Pose encoder firstly predicts landmarks from input image,then predicts masks from landmarks.Hence,the generated masks don’t need follow gaussian distribution.Because of this,our model can generate more complicated and accurate mask.Besides,this operation also ensures the independence between masks and appearance and forces model to disentangle image.Experiments prove that the landmark detector automatically converges to a reasonable state in self-supervised learning training process and can localize face in image correctly.Also,we show comparison results to prove that masks generated by pose encoder have semantic features. |