In today’s era of big data,smart products such as mobile phones and computers have become indispensable objects in people’s daily lives,and people use them to learn,entertain and understand the world.Under such a background,image and video data have grown rapidly,so how to use computer to quickly and efficiently search and utilize images and videos becomes crucial.For humans,it is extremely simple to describe an image or a video,but it is not ideal to process a large amount of image and video data manually.Therefore,how to use computers to describe images and videos in natural language like humans has aroused interest.Image and video description research combines key technologies in the fields of natural language processing and computer vision,and has broad application prospects.Image and video descriptions can help people quickly retrieve the information they need,perform human-computer interaction,and help visually impaired people understand the content of images and videos.Early image and video description methods are mainly based on retrieval and templates,which are simple but not ideal.Nowadays,due to the great achievements of deep learning in computer vision and natural language processing,more and more people begin to study image and video description methods based on deep learning.In recent years,image and video description methods often use convolutional neural networks to extract image or video features,and use recurrent neural networks to generate natural description sentences,that is,encoder-decoder structure.This paper mainly studies image and video description methods based on deep learning.First,this paper analyzes and studies the low correlation between image features and description sentences in image description,and designs a new attention mechanism that can better correlate image features and word features.The image description model in this paper uses two encoders in the encoding part,namely VGG19 and RESNET101,and the decoder uses the Long Short-Term Memory network and introduces the attention mechanism.Finally,experiments are carried out in the image description public dataset,and the feasibility and effectiveness of the new attention mechanism are proved through experiments,and the superior performance of the image description model based on the attention mechanism designed in this paper is verified.Secondly,this paper expands from image description to video description,because video is more complex and diverse,and the study of video description is much more difficult than image description.The video description model framework of this paper adopts the deep convolutional neural network Inception V4 in the encoder part,and uses the Long Short-Term Memory network to generate the natural description sentences of the video in the decoder part.At the same time,the attention mechanism designed in the image description is applied to the video description,and the attention mechanism is further improved to obtain the attention mechanism based on dilated convolution.The attention mechanism increases the receptive field through dilated convolution without increasing the parameters of the network model,and can better correlate video frame information and description sentence information.Finally,experiments are carried out on the public video description dataset,and the effectiveness of the video description model based on dilated convolution attention mechanism is proved by the improvement of evaluation index scores and the effective natural description sentences for videos.Finally,the previous video description model only considers forward flow,that is,from video to natural description statements,but does not utilize the information from statement to video.In order to make the description statement consistent with the content in the input video,a reconstruction mechanism is introduced into the video description model.The reconstruction mechanism can use the description sentences generated by the decoder to reproduce the video frame features,and then further optimize the performance of the video description model by comparing the reconstructed features with the video features originally extracted by the encoder.Convolutional neural networks with different network depths are used as encoders,and a large number of experiments are carried out on the data set.The experimental results show that the video description model with the introduction of the reconstruction mechanism is better than most mainstream methods. |