Research On Video Prediction Based On Spatiotemporal Information Fusion

Posted on:2024-01-09

Degree:Master

Type:Thesis

Country:China

Candidate:L Tang

Full Text:PDF

GTID:2568307106486294

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

With the rapid development of network technology and communication technology,many fields such as weather forecasting,traffic flow forecasting,human movement prediction,and autonomous driving are facing a sharp increase in video data.As a kind of multimedia data,video not only contains the spatial information of the image,but also the dynamic information of the time dimension.Video prediction is a self-supervised way to predict future frames under the condition of given consecutive frames,without manual labeling,which greatly reduces heavy manual labeling work and a lot of time consumption.The existing video prediction model cannot fully extract and fuse the longterm dependent information of continuous video frames and complex spatiotemporal information,so it cannot effectively predict the motion state of objects,resulting in blurring and artifacts in the prediction results.In view of the above problems,the main research content of thesis is as follows:Firstly,aiming at the problem that the video prediction model cannot fully extract and fuse the long-term dependence information of continuous video frames,thesis first studies the current improved classical video prediction model based on long-term dependency information and analyzes its advantages and disadvantages,and then proposes a TR-LSTM model,which consists of two parts:(1)In order to alleviate the uneven overlap phenomenon of transposed convolution in the decoding structure when expanding the resolution of the input feature map,we introduce subpixel convolution and propose a subpixel convolution decoding network;(2)In order to extract and fuse the long-term dependent information of video frames more fully,TR-LSTM,a time recall long short-term memory structure,is designed,which realizes the product of the temporal information state tensor of the historical frame and the adaptive weight tensor corresponding to it,and fuses it through the spatial channel attention mechanism.On the Moving MNIST benchmark dataset,the MSE index of the TR-LSTM model is reduced by 3.0,and the SSIM index is increased by 0.009.On the KTH benchmark data set,the SSIM and PSNR indicators of the TR-LSTM model increased by 0.02 and 3.77 respectively in the case of 20 frames after prediction,and increased by 0.07 and 4.91 in the case of 40 frames after prediction.In addition,ablation experiments on the sub-pixel convolutional decoding network are also performed on these two benchmark datasets,and it is verified that the proposed decoding network can alleviate the uneven overlapping of upsampled images.Secondly,aiming at the problem that the video prediction model cannot fully integrate the complex spatiotemporal information,the current improved classical video prediction model based on spatiotemporal information fusion is studied,and its advantages and disadvantages are analyzed,and then the STSM model is proposed,which consists of two parts:(1)In order to fully integrate the spatiotemporal information of video frames,STSMCell is proposed.This model uses large convolution kernels to effectively extract the spatiotemporal information of adjacent frames and uses adaptive weighted tensors to fully fuse the spatiotemporal information.(2)In order to supplement the problem of spatial fine-grained information loss in the encoding process,a short-term information recall mechanism SIR is proposed,which uses adaptive weight tensors to fuse the background space detail information of adjacent frames to make the predicted video frames clearer.On the two benchmark data sets of Moving MNIST and Traffic BJ,the MSE indicators of the STSM model were reduced by 1.8 and 1.3,respectively,and the MAE indicators were reduced by 3.7 and 0.2.On the KTH benchmark data set,the SSIM and PSNR indicators of the STSM model increased by 0.02 and 0.15 in the case of20 frames after prediction,and increased by 0.003 and 0.23 in the case of 40 frames after prediction.In addition,ablation experiments on SIR are performed on three benchmark datasets,verifying the supplementary role of this module to spatial fine-grained information.

Keywords/Search Tags:

Video Prediction, Long-term Dependency Information, Spatiotemporal Information Fusion, Subpixel Convolution, Information Recall Mechanism

PDF Full Text Request

Related items

1	Double Interactive Behavior Recognition Based On RGB And Depth Information Fusion
2	Research On Long-term Target Tracking Algorithm Based On Spatiotemporal Feature Enhancemen
3	Research And Implementation Of Video Action Recognition Based On Long-Time Feature Fusion And Attention Mechanism
4	Research On Long-Term Video Prediction Using Taylor Disentanglement
5	Chinese Information Retrieval Based On Term Dependency Information
6	Research On Information Security Management Mechanism In The Long-term Preservation Of Network Information Resources
7	Information Fusion Algorithms For Multisensor System With Spatiotemporal Bias
8	Research On Audio-Video Information Processing Based On Lip-Changing
9	Video Action Detection Based On Temporal Analysis
10	Study On The Mechanism For Long-term Preservation Of Network Information Resource