Font Size: a A A

Video Captioning Based On Deep Learning

Posted on:2020-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:P Y NingFull Text:PDF
GTID:2428330590484507Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Along with the explosive growth of video services,the manual use and management of video can no longer meet the needs of business development.It is urgent to introduce a series of computer-based automated video analysis methods.Video Captioning can convert video content into easy-to-process natural language description,which is an important technology to process video information.The existing video description method based on Deep Learning cannot meet the needs of actual production and life.In this paper,the key issues are studied and the main work is summarized as follows:To improve the description accuracy of video objects in the video description model,a video frame feature extractor based on high-level semantics is proposed.The feature extractor includes four processing links,namely target detection,target and feature matching,feature enhancement and feature form conversion.In each link,the influence of special video conditions on feature extraction is analyzed and corresponding processing is proposed to improve feature reliability.In addition,based on the interpretable high-level semantic information,the feature extractor can adjust parameters or replace components according to the performance on specific video data,which has better generality.The experimental results show that the extracted video frame features effectively improve the performance of the video description model on Microsoft Video Description(MSVD)dataset and demonstrate the effectiveness of the method of improving video description accuracy by using high-level semantic information.To improve the description ability of the video description model for complex video objects and scenes,the encoder improvement based on feature fusion is proposed.On the one hand,the Dense Connection Network(DenseNet)is used to improve the visual feature extraction at various semantic levels of the video,thereby improving the diversity and description capability of the features.On the other hand,the typical feature fusion paradigm in the Deep Learning model is summarized and four feature fusion frameworks oriented to video description are adopted to improve the encoder network structure.The experimental results show that the improved encoder based on feature fusion enables the fused features to be both accurate and diverse,effectively improves the performance of the video description model on MSR-VTT dataset,and demonstrates the effectiveness of the method of improving the description capability of the model through feature fusion.To alleviate the slow operation of the Recurrent Neural Network(RNN),which is not conducive to the research and application of the video description model,improvement based on the new RNN is proposed.On the one hand,the new RNN parameters and state reduction are employed to reduce the computational redundancy of the video description encoder.On the other hand,the structure of the new RNN is used to reduce the difficulty of model training optimization to maintain the performance of the model.Two new types of RNN,namely SRU and IndRNN,were specifically selected for experiments.The experimental results show that compared to the video description model using traditional RNN as encoder,the SRU encoderbased model can improve the computational efficiency while maintaining performance,and the improvement range is no less than 6.4 percent.The model based on IndRNN encoder can improve the calculation efficiency under the condition that the performance loss is no more than 11 percent and the improvement range is no less than 30.9 percent.The results show that the new RNN is effective in improving the computational efficiency of the video description model.
Keywords/Search Tags:Video Analysis, Video Captioning, Feature Extraction, Feature Fusion, Computational Efficiency
PDF Full Text Request
Related items