Font Size: a A A

Video Description Technology Based On Deep Learning

Posted on:2021-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:G HaiFull Text:PDF
GTID:2428330605956100Subject:Instrument Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence,deep learning applications are more and more widely used in video description technology.Its mainstream framework is Encoder-Decoder,which uses convolutional neural networks to extract visual features from videos,and then uses recurrent neural networks to generate video descriptions by using visual features.On the basis of this framework,most of the model frameworks do not have deep mining of temporal and spatial features of video.Moreover,in language generation models,the methods of fusing video features and text features are very simple.These methods do not consider the deep interaction of features.When using the category information in the data set,there is no deep mapping training,only ordinary weighted fusion methods are used,and there is no specific extraction and use method for semantic features,which will cause the video description generated by the model to have no reasonable and logical.When training the model of the decoding part of the framework,the current training methods used are traditionally using manually labeled words to guide the caption generation model to gradually generate video caption.When selecting the output word distribution value of the caption generation model each time,it is based on the principle of maximum value selection.These difficulties greatly hinder the generation of video caption with high readability and accuracy.In order to solve the above problems,this paper first designs a video description network framework(CRFAC-S2VT)based on the S2 VT framework that combines a two-level attention mechanism and a compact linear pooling layer layer.The video pre-processing stage of the framework first uses visual features and category features in the data set as inputs and labels for CNNs,performs category training on CNNs,and then uses the trained CNNs to extract the visual features of the video.In the encoding stage of the framework,a convolutional region attention mechanism is designed in this paper.It can focus on the relevant regions of the extracted 2D visual feature map without destroying the spatial structure of the map,and then output the focused 2D visual feature,and then fuse the visual feature with the C3 D visual feature containing the timing information.Finally,we use the model characteristics of the encoder to model the mixed visual feature with timing and spatial information.In the decoding stage of the framework,this paper designs an attention mechanism that focuses on the key frames of the video.The text features in the data set and the video features following the frame-level attention mechanism are input into the compact linear pooling layer.The neural network layer fine-grained fusion of these two types of features,taking the mixed features as the input of the decoder,and using the decoder to generate accurate video caption with high correlation with the video.Secondly,based on CRFAC-S2 VT,this paper designs a combined framework SFAC-S2 VT that includes a semantic detector,an M-LSTM with input multi-modal feature functions,and a caption structure loss function.This semantic detector is multi-labeled and used to extract semantic features in the video.Using autonomous random training method to train the decoder of S2 VT,at each moment when the decoding model generates words,the method uses the words predicted by the decoding model as the input to the decoding model at the next moment,and uses a random value method for the distribution value selection method of the decoding model output.Loss function for the structure of captions in this framework can adjust the length of the output video caption.We test the proposed network framework on the MSR-VTT and MSVD datasets.The results show that the network framework we designed can compete with the current advanced level on these two data sets.
Keywords/Search Tags:Video description, Convolutional region attention mechanism, Fine-grained feature fusion, Autonomous random training, Loss function for the structure of captions
PDF Full Text Request
Related items