| With the rapid development of the mobile internet,the volume of video data is increasing exponentially.By modeling massive video data and understanding its spatiotemporal structure characteristics without additional manual annotation,it has high research and practical application value for several scenarios such as weather forecasting,autonomous driving,and intelligent transportation systems.The core task of video prediction is to achieve refined prediction of future trends by observing the mapping relationship between historical data and future data in space.Currently,most models do not sufficiently extract temporal information between video frames and pay insufficient attention to the motion foreground,resulting in the occurrence of fuzzy prediction and artifacts.In this thesis,optical flow features are added to convolutional long-term and short-term memory networks to achieve better video prediction results.The main research work is as follows:(1)Aiming at the problem of insufficient extraction of motion information in video prediction networks,a new encoding and decoding network based on convolutional long and short time memory networks is proposed.Jump connection structures and optical flow modules are added to supplement the motion details of images,and a video prediction model based on optical flow characteristics,called FED-Conv LSTM,is constructed.Experiments have confirmed that the cascade integration of the RAFT optical flow module and the FED-Conv LSTM network has achieved good prediction results on three global public datasets,Moving MNIST,KTH,and Human3.6,with structural similarity metrics of 0.928,0.892,and 0.878,respectively.(2)Aiming at the insufficient attention of video prediction networks to the motion foreground,this thesis proposes an optical flow pyramid video prediction model based on foreground attention,called PFED-Conv LSTM.The proposed PFED-Conv LSTM model not only captures global information at large scales,but also captures image details at small scales by constructing a Gaussian pyramid for optical flow images.In addition,a foreground attention mechanism is designed,in which pooling operations strengthen the foreground area and suppress the background area,and void convolution operations expand the receptive field.The foreground attention module allocates attention resources reasonably from both spatial and channel dimensions.The structural similarity of PFED-Conv LSTM on Moving MNIST,KTH,and Human3.6reached 0.933,0.903,and 0.890,respectively,further improving the prediction performance.(3)Based on a large number of comparative experiments with existing classical video prediction models,it is confirmed that the prediction effect of the proposed model is better than the existing classical models,and the overall performance is more competitive. |