| Deep learning plays an important role in many fields such as Target detection and speech recognition.However,existing deep neural networks,while showing good performance in the tasks of image generation and classification,struggle to meet the needs of modern industrial applications in the prediction of video streams.The two-dimensional convolution-based neural network model cannot model optical streams in the temporal domain,and the Convolutional Long Short Term Memory(Conv LSTM),which combines two-dimensional convolution and long and short term memory networks,requires high computing power.In this thesis we propose to build a Generative Adversarial Network(GAN)by combining two-dimensional and three-dimensional convolution to synthesize video future frames step by step,and achieve a step-by-step synthesis model that can predict up to 32 frames.The main content as follows.(1)A new neural network model,the Accompanying Convolutional Network,is proposed for improving the convergence speed of video future frame prediction models.The concomitant convolutional neural network consists of three parts: a deep model composed of multilayer multiscale shallow neural networks,selective transmission,and selective multiscale optimization.Firstly,the pre-training model was trained,and secondly,the dataset was fed into the same two models,and the output results of each model at each stage were counted,and then selective transmission was performed by comparing the PSNR values with the pre-training model.The experimental results show that the accompanying convolutional neural network can effectively accelerate the convergence of the models under the PSNR and recover more image details in the comparison with the experimental results of pix2 pix HD.(2)To address the problem that the training of deep neural networks requires high arithmetic power,we propose a generative adversarial network combining 3D convolution and 2D convolution for the prediction of future video frames using 3D convolution-based and 2D convolution-based neural networks for the extraction of optical flow of video frames in the temporal dimension and image translation of predicted video frames,respectively.The model needs to fuse the edge detection map and semantic segmentation map,feed the fused images into the fused image prediction generator built on the basis of 3D convolution,and then perform image translation to get the real video frames(3)To further improve the quality of generated video frames and accurate evaluation of the generated video stream quality,a new loss function and evaluation index are proposed in this thesis.The loss function consists of cross-entropy loss,L1 canonical loss and Structural Similarity Index(SSIM)loss.The evaluation metric is improved on the basis of Frechet Inception Distance(FID).In the tests of different scenarios,the proposed model can accurately and clearly predict the trend of future video frames,and the proposed loss function and model structure can effectively constrain the quality of generated video frames. |