With the rapid development of network technology and communication technology,many fields such as weather forecasting,traffic flow forecasting,human movement prediction,and autonomous driving are facing a sharp increase in video data.As a kind of multimedia data,video not only contains the spatial information of the image,but also the dynamic information of the time dimension.Video prediction is a self-supervised way to predict future frames under the condition of given consecutive frames,without manual labeling,which greatly reduces heavy manual labeling work and a lot of time consumption.The existing video prediction model cannot fully extract and fuse the longterm dependent information of continuous video frames and complex spatiotemporal information,so it cannot effectively predict the motion state of objects,resulting in blurring and artifacts in the prediction results.In view of the above problems,the main research content of thesis is as follows:Firstly,aiming at the problem that the video prediction model cannot fully extract and fuse the long-term dependence information of continuous video frames,thesis first studies the current improved classical video prediction model based on long-term dependency information and analyzes its advantages and disadvantages,and then proposes a TR-LSTM model,which consists of two parts:(1)In order to alleviate the uneven overlap phenomenon of transposed convolution in the decoding structure when expanding the resolution of the input feature map,we introduce subpixel convolution and propose a subpixel convolution decoding network;(2)In order to extract and fuse the long-term dependent information of video frames more fully,TR-LSTM,a time recall long short-term memory structure,is designed,which realizes the product of the temporal information state tensor of the historical frame and the adaptive weight tensor corresponding to it,and fuses it through the spatial channel attention mechanism.On the Moving MNIST benchmark dataset,the MSE index of the TR-LSTM model is reduced by 3.0,and the SSIM index is increased by 0.009.On the KTH benchmark data set,the SSIM and PSNR indicators of the TR-LSTM model increased by 0.02 and 3.77 respectively in the case of 20 frames after prediction,and increased by 0.07 and 4.91 in the case of 40 frames after prediction.In addition,ablation experiments on the sub-pixel convolutional decoding network are also performed on these two benchmark datasets,and it is verified that the proposed decoding network can alleviate the uneven overlapping of upsampled images.Secondly,aiming at the problem that the video prediction model cannot fully integrate the complex spatiotemporal information,the current improved classical video prediction model based on spatiotemporal information fusion is studied,and its advantages and disadvantages are analyzed,and then the STSM model is proposed,which consists of two parts:(1)In order to fully integrate the spatiotemporal information of video frames,STSMCell is proposed.This model uses large convolution kernels to effectively extract the spatiotemporal information of adjacent frames and uses adaptive weighted tensors to fully fuse the spatiotemporal information.(2)In order to supplement the problem of spatial fine-grained information loss in the encoding process,a short-term information recall mechanism SIR is proposed,which uses adaptive weight tensors to fuse the background space detail information of adjacent frames to make the predicted video frames clearer.On the two benchmark data sets of Moving MNIST and Traffic BJ,the MSE indicators of the STSM model were reduced by 1.8 and 1.3,respectively,and the MAE indicators were reduced by 3.7 and 0.2.On the KTH benchmark data set,the SSIM and PSNR indicators of the STSM model increased by 0.02 and 0.15 in the case of20 frames after prediction,and increased by 0.003 and 0.23 in the case of 40 frames after prediction.In addition,ablation experiments on SIR are performed on three benchmark datasets,verifying the supplementary role of this module to spatial fine-grained information. |