Font Size: a A A

Video Caption Based On Enhanced Visual-semantics

Posted on:2020-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:J D YeFull Text:PDF
GTID:2428330623459099Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of artificial intelligence,video captioning that combining Computer Vision and Natural Language Processing,has been widely concerned in recent years.Given a video,video captioning automatically generates a description of natural language to describe the content of the given video.It has a wide range of applications in real life,such as improving the efficiency of people searching for the required video on the Internet,and assisting people with disabilities to understand video content.At present,due to the breakthrough progress of deep learning in many fields such as Computer Vision(CV)and Natural Language Processing(NLP),more and more works have begun to use deep learning technology to solve related problems.Similarly,this paper studies the video captioning use the "encoder-decoder" framework based on deep learning.In the traditional captioning model framework,a convolutional neural network is usually used as an encoder,and a recurrent neural network is used as a decoder to generate related sentences for videos.Due to the huge difference between visual information and semantic information,a captioning model can not learn the relationship between the visual and semantic information well if only use the decoder.To relieve this issue,this paper aims to narrow the gap of the visual and semantic through three aspects.First,we add a visual and semantic correlation module to the "encoder-decoder" framework.Hence,the model can learn the relationship between the visual and the semantic better thus enhance their consistency.Second,we propose a more strong video feature extraction for video captioning.By the video retrieval model,visual features with more semantic information and stronger representation are obtained and enhance the model's understanding of the semantic information in the visual information.Third,we propose to mining the semantic attributes of the video to improve the caption model.A retrieval model is used to obtain the relevant text information of the video.By extracting the labels of these texts,the semantic attributes of the video are obtained,we then use the obtained to semantic attributes to assist the model to generate better sentences.Moreover,we conducted experiments on MSRVTT and MSVD data sets to verify theeffectiveness of our methods.
Keywords/Search Tags:Deep learning, Consistency of visual and semantic, Video caption, Semantic mining
PDF Full Text Request
Related items