| The advent of 5G era will further promote the vigorous development of short video,and human-centered short video creation has entered the public life.Video instance-level human parsing technology can be used to obtain the segmentation information of each body part in the video,which can be used to create short videos more conveniently.There are two methods to handle this task.The first one processes the video frame-by-frame,which is simple and effective,but cannot deal with motion blur.The other one uses multi-frame images as input,which is slower and has a larger model.In this thesis,we propose Multi Frame Propagation Net(MFPNet)to solve the shortcomings of the two methods.The spatial features from the current frame and the temporal features from the previous k frames are unified into MFPNet.The main contributions are shown below.First,we propose two modules Position-Squeeze-and-Excitation(P-SE)and Global Attention Module(GAM).P-SE applies the idea of Squeeze-and-Excitation(SE)to spatial locations.It can learn a spatial attention map,which represent the correlation degree of human parts.GAM is a combination of SE and P-SE,which can extract global structured features.Further,Human Parsing Attention Net(HPANet)based on the current frame is proposedSecond,a propagation module is proposed to obtain the temporal features between video frames.This module consists of 3D convolution and Convolutional Gated Recurrent Unit(ConvGRU).3D convolution can obtain the spatiotemporal features between consecutive frames,and ConvGRU further obtains the temporal features.MFPNet is composed of HPANet and the propagation module.Compared with other mainstream methods,MFPNet achieves better performance on Video Instance-level Parsing(VIP)dataset. |