| Video captioning aims to automatically generate correct and coherent natural sentences to describe the content of a given video segment.The difficulty of video captioning lies in not only identifying the objects,attributes,and behaviors in the video,but also understanding the interrelationship between them,and more importantly,making the generated description sentences conform to the constraints of grammar.Video captioning has broad application prospects in the fields of video retrieval,human-computer interaction and autonomous driving,and has become a research hotspot in the field of computer vision.At present,the encoder-decoder structure based on the attention mechanism has become the core and most commonly used network architecture in video semantic description.It employs convolutional neural networks as encoders to extract visual features in video sequences,and then feed them into the decoder to generate word by word.Meanwhile,the incorporation of the attention mechanism to align visual semantics with text semantics.However,this structure still has the following shortcomings:(1)The attention mechanism only models the dependence of the word sequence on the source video sequence,and does not consider the text sequence generated by the decoder,which reduces the semantic consistency of the description sentence.(2)The attention mechanism does not consider the sparse alignment of visual semantics and text semantics,leading to redundant visual information and therefore reduces the accuracy of word prediction.(3)The decoder does not refine the hidden state during word-by-word decoding and cause a cascaded prediction error due to a word prediction error.To address the above issues,the research on generating of video captioning based on sequence-to-sequence model is carried out.The core is to incorporate a contextual information-oriented deliberation network into the encoder-decoder network architecture,correcting the error hidden state generated by the decoder in time,and further design a sparse attention mechanism to filter out irrelevant visual information.Specifically,a video captioning method based on the contextual deliberation network is proposed.The network structure is composed of an encoder,a decoder and a deliberator.The encoder uses a pre-trained convolutional neural network to extract a fixed-dimensional video representation from the source video.The decoder uses a recurrent neural network to decode the encoded video representation into the hidden state of the initial version.The deliberator uses visual context and text context to further refine the hidden state of the initial version to avoid cascading errors.In addition,a gating mechanism is designed to dynamically balance between visual context and text context.Compared with the encoder-decoder method based on the attention mechanism,the CIDEr index of the proposed method on the MSVD and MSR-VTT data sets is increased by 6.3% and 7.2% respectively.In order to further enhance the sparse alignment relationship between visual semantic information and text semantic information,this paper proposes a video captioning method based on the integration of sparse attention mechanism.The core is to construct two sparse attention mechanisms based on the proposed contextual deliberation network,which are the explicit sparse attention mechanism and the implicit sparse attention mechanism.Specifically,the explicit sparse attention mechanism treats elements in the correlation score matrix below the set threshold as irrelevant information,and explicitly sets its corresponding attention weight to zero;the implicit sparse attention mechanism Use designed gating to implicitly filter out irrelevant information in regular attention results.Compared with the contextual deliberation network based on the conventional attention mechanism,the CIDEr index of this method on the MSVD and MSR-VTT datasets increased by 0.6% and 0.4%,respectively.The main contributions of this thesis are summarized as follows:(1)Aiming at the problem of cascading errors caused by the decoder’s failure to correct the inaccurate hidden state in time,a method of video captioning based on the contextual deliberation network is proposed.This method uses visual and textual context to further refine the hidden state of the initial version generated by the decoder to avoid cascading errors.(2)Aiming at the problem of reducing the semantic consistency of the sentence due to the dependence of the unmodeled target words of the attention mechanism on the generated text sequence,this thesis uses the attention mechanism to extract relevant information from the refined hidden state sequence of the deliberator as text context,and incorporate it into the deliberating process of the deliberation network.(3)Aiming at the attention mechanism that does not consider the sparse alignment between visual semantics and text semantics,this thesis designs two types of sparse attention,"explicit sparse attention" and "implicit sparse attention",to filter out visual information unrelated to word prediction. |