| The rapid development of self-media and short videos has made video data show an explosive upward trend.How to obtain key information intelligently and quickly from massive video data has become a significant research focus in the direction of artificial intelligence.Dense video captioning is a task that combines the fields of computer vision and natural language processing.It aims to automatically locate segments according to events in the video,and generate natural language captions for each video segment.With important application value,it can be used in human-computer interaction,video retrieval,intelligent editing and other aspects.Visual features are commonly used video feature representations,but sound features are equally important.Using RNN,LSTM or their variants as the main model architecture for dense video captioning is difficult to solve the problem of long-term dependence,parallel training cannot be achieved,and the generated video segments recall rate is low and the semantic captions is not accurate.Aiming at the above problems,a dense video captioning algorithm based on multi-mode Transformer and Anchor is proposed.The main work and innovations of this paper are summarized as follows:(1)Aiming at the problems of insufficient use of video features of the video captioning algorithm and the difficulty of solving long-term dependence by using LSTM or its variants as the main architecture,a single-mode dense video captioning model based on Transformer is constructed.Firstly,the visual features,audio features and speech features in the video are extracted,and the position coding is applied to each feature.The self-attention mechanism is introduced to calculate the correlation degree between elements in the feature sequence of single-mode video frames,and the correlation between the feature sequence of caption words and the single mode features of video is obtained.The word with the highest probability is generated one by one in the captioning text generator.When tested on ActivityNet Captions dataset,compared with the experimental results of LSTM as the main architecture model,the value of METEOR index is improved by 1.47%.Meanwhile,the contribution degree of each mode to dense video captioning is visual feature > audio feature > voice feature.(2)Aiming at the problems of inaccurate positioning and low recall rate of the existing video positioning segmentation algorithm,a video positioning segmentation method based on Anchor is proposed.Wherein,the event location detection module is composed of three-layer convolutional neural network.The K-Means algorithm is used to cluster the length of the tag video segment,and the cluster centers of different lengths are used as the length of the anchor.Inspired by the YOLOv3 target detection algorithm,the center,length,and confidence score of the predicted frame are obtained after fine-tuning the Anchor on a single scale,and the predicted video segments are obtained by sorting the confidence scores.During training,the gap between the predicted frame and the real frame is used to construct a loss function,and the weight parameters are updated by backpropagation to realize the end-to-end training process.Compared with PDVC model,this method improves the F1-score index by 2.43%,showing strong video positioning segmentation ability.(3)Aiming at the problems of the lack of effective interaction between existing multi-mode video features,an intensive video description model based on multi-mode Transformer and Anchor is proposed.Based on traditional Transformer,multi-mode input channels are added,multi-mode attention mechanism is introduced to calculate the correlation between different mode video frame features,and the correlation between captioning sequence features and multi-mode video frame features is calculated to strengthen the interaction between features.The dual-mode event location detection module was used to generate prediction segments through Anchor decoding when different modes were input,and stored in the public prediction database.According to the sorted and screened prediction segments,the whole input video feature sequence was cut and re-introduced into the model to generate the description text of the segment,thus realizing the intensive video captioning task.Compared with PDVC model,this model has an increase of 3.39% in F1-score index and 0.34% in METOER index compared with BMT model.Experiments show that the intensive video captioning model based on multi-mode Transformer and Anchor has strong video segmentation and semantic caption capabilities. |