| Video frame prediction is a pixel-level prediction task based on deep learning algorithm.It predicts and generates future frames by observing a series of input frames.Due to the need to solve the problem of complex texture appearance and time-varying motion of video frames,the task becomes very complicated.However,because the video frame prediction task can solve the problem of scene and action prediction,the video frame prediction model has a broad application prospect in robot system,automatic driving and video surveillance.According to the number of video frames predicted,the video frame prediction task is divided into single frame prediction task and multi-frame prediction task.In this paper,a semantic fusion two-stream prediction model is designed to accomplish these two tasks.The model includes one semantic auxiliary path and two predictive paths.The input of semantic assistance path is semantic maps,which firstly extracts semantic features through the CNNs,and then explores the spatio-temporal relationship between different semantic features through the GCNs.Finally,this feature information is used to help prediction.The semantic path introduces the scene information in the image to the model,thus helping the model to better understand the scene.The input of the two prediction paths of the model are optical flow maps and original images respectively.The CNN is built to encode optical flow and RGB images to capture the spatial-temporal features contained in the video frames,and then the Conv LSTM and convolutional decoder are used to predict future video frames.Finally,the fusion module is used to fuse the output video frames of the two prediction paths.This two-stage approach helps to produce better predictions.The multi-frame prediction model makes subtle adjustments to the single-frame prediction model,and proposes a Conv LSTM network with attention mechanism for prediction,so that the input can be modeled for a longer time.In this paper,single-frame prediction and multi-frame prediction experiment were carried out on Caltech traffic dataset,and two parts of comparison experiment were set up,namely the ablation experiment and the comparison experiment with previous algorithms.The results of the ablation experiment show that the semantic fusion strategy and the dual-path fusion strategy in the model can effectively improve the quality of image prediction.In comparison with previous algorithms,the model in this paper can predict better and sharp images,and exceeds most of the model algorithms in the evaluation metrics of PSNR,SSIM and LPIPS. |