Video Caption Based On Enhanced Visual-semantics

Posted on:2020-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:J D Ye

Full Text:PDF

GTID:2428330623459099

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of artificial intelligence,video captioning that combining Computer Vision and Natural Language Processing,has been widely concerned in recent years.Given a video,video captioning automatically generates a description of natural language to describe the content of the given video.It has a wide range of applications in real life,such as improving the efficiency of people searching for the required video on the Internet,and assisting people with disabilities to understand video content.At present,due to the breakthrough progress of deep learning in many fields such as Computer Vision(CV)and Natural Language Processing(NLP),more and more works have begun to use deep learning technology to solve related problems.Similarly,this paper studies the video captioning use the "encoder-decoder" framework based on deep learning.In the traditional captioning model framework,a convolutional neural network is usually used as an encoder,and a recurrent neural network is used as a decoder to generate related sentences for videos.Due to the huge difference between visual information and semantic information,a captioning model can not learn the relationship between the visual and semantic information well if only use the decoder.To relieve this issue,this paper aims to narrow the gap of the visual and semantic through three aspects.First,we add a visual and semantic correlation module to the "encoder-decoder" framework.Hence,the model can learn the relationship between the visual and the semantic better thus enhance their consistency.Second,we propose a more strong video feature extraction for video captioning.By the video retrieval model,visual features with more semantic information and stronger representation are obtained and enhance the model's understanding of the semantic information in the visual information.Third,we propose to mining the semantic attributes of the video to improve the caption model.A retrieval model is used to obtain the relevant text information of the video.By extracting the labels of these texts,the semantic attributes of the video are obtained,we then use the obtained to semantic attributes to assist the model to generate better sentences.Moreover,we conducted experiments on MSRVTT and MSVD data sets to verify theeffectiveness of our methods.

Keywords/Search Tags:

Deep learning, Consistency of visual and semantic, Video caption, Semantic mining

PDF Full Text Request

Related items

1	Research On Image Semantic Fine-grained Caption Method Based On Deep Learning
2	Visual Semantic SLAM Based On Deep Learning
3	Research On Key Techniques Of Visual Semantic Understanding
4	Research On Some Problems Of Visual Semantic Understanding
5	Research On Semantic Map Construction And Semantic Navigation Of Service Robots Based On Deep Learning
6	Research On Construction Of Visual Semantic Map Based On 3D Point Cloud Deep Learning
7	The Research And Implementation Of Image Caption System Based On Deep Learning
8	Research Of Video Semantic Caption Generation Based On Reconstructing Features
9	Research On Image Semantic Annotation And Caption Based On Deep Learning
10	Research On Several Issues In Video Semantic Annotation