In recent years,with the continuous breakthrough and innovation of technology in the field of computer vision,in various tasks facing computer vision,video prediction has become a hot direction of current deep learning research because it does not need to label data manually and there are a large number of available video data in daily life.The video prediction technology is based on the given image information,and predicts the future video frames by constructing an internal representation model that can accurately model the video content and dynamic changes accurately.The purpose is to enable the model to automatically generate future image frames by learning a series of previous frames,and apply them to many scenarios such as robots,autopilot cars and UAV decision making in advance.However,compared with images,video contains not only spatial dependence,but also time dependence,so the task of video prediction is very challenging.In view of the lack of temporal or spatial feature extraction in most current video prediction technologies,resulting in unclear future frames,blurred imaging and lack of local details of future frames in the long-term prediction of video,this paper studies the video frame prediction algorithm based on Long Short Term Memory networks(LSTM).In this paper,dense convolution is added to the Convolutional Long Short Term Memory networks(ConvLSTM)to better extract spatial features.In addition,non-local blocks are introduced outside the model to achieve better prediction effect and improve the generation quality.The specific research contents and work are as follows:For the video prediction task,firstly,this paper studies and analyzes the current mainstream prediction algorithms.Then,the typical video frame prediction algorithms based on LSTM are deeply discussed,and the performance and existing problems of the algorithms are analyzed and compared.On this basis,by improving the internal structure of ConvLSTM network,this paper obtains Dense Convolutional Long Short Term Memory networks(Den-ConvLSTM).The main contributions are as follows: 1)Aiming at the common problems of insufficient spatial information extraction and blurred imaging in most video frame prediction algorithms based on LSTM,this paper improves the internal structure of ConvLSTM network,and changes the original convolution layer to dense convolution layer,so as to reuse features and improve the imaging effect of the model.2)Aiming at the problem that convolution can only cover the adjacent region around the pixel,but can not obtain the features of the far region,resulting in the loss of feature information,this paper introduces non-local blocks into the prediction model.3)Using Den-ConvLSTM module and non-local block module,spatio-temporal modeling of the encoded feature map is carried out,and the spatio-temporal information between the feature maps is learned to obtain the final feature map.This paper evaluates the improved algorithm and current typical video prediction algorithm on Moving-Minst synthetic data set,KTH action recognition data set and Kitti automatic driving data set,and uses peak signal to noise ratio(PSNR)and structural similarity index measure(SSIM)as evaluation criteria.PSNR and SSIM are both values.The larger the value,the better the prediction effect.The experimental results show that under the same experimental environment setting,the PSNR and SSIM values of the Den-ConvLSTM video frame prediction algorithm proposed in this paper on the Moving-Minst synthetic data set are higher than those of the original network ConvLSTM.The average value of PSNR is19.52,which is better than 19.16 of the original network,and even better than 19.44 of Pred RNN algorithm.In terms of SSIM value,the improved network reaches 0.8812,which is also better than 0.8870 of the original network.At the same time,the video data on the other two data sets can be predicted relatively accurate.Compared with ConvLSTM algorithm,the prediction algorithm in this paper has obvious advantages in prediction ability.In terms of imaging quality,the algorithm in this paper has also been greatly improved and has better performance. |