Font Size: a A A

Natural Language Generation Description Method For Short Videos

Posted on:2020-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:J B AiFull Text:PDF
GTID:2428330596476514Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,there is a large accumulation of multimedia data.The most complicated and important is the large amount of video information.In the era of the 21 st century,people's living standards have been greatly improved,and the pace of life rhythm is also continuing to accelerate.People's time is particularly important.It has become particularly cumbersome for people who need to spend a lot of time to watch short videos in order to get information they need.it is especially important if we can convert the video information to relevant text information that people can get relevant information directly and quickly by reading short text.Based on this premise,how to effectively and quickly convert short video information into relevant text information has become an urgent problem to be solved.The most important research areas of video to text is about how to convert the video data with rich information to relevant text information for people which we also call“Talking about videos”.In essence,the integration between vision and language is the ability of human beings to develop through long evolution and continuous learning.For machines,this means the connection between visual processing and language processing.Although deep learning gives the machine a strong ability to recognize and understand1 D image data and 2D video data,it still faces the problem of insufficient robustness in real application scenarios.How to effectively extract the rich semantic information and scene information of video data and effectively judge whether the generated text information is reasonable and natural appears to be particularly important.The natural language generation method based on deep learning is an emerging research direction,and there are many potential areas worth exploring.At the same time,we will face more and more challenges in the future.In recent years,the framework of generative adversarial networks has been widely used,mainly in the design of dialogue systems,image synthesis or enhancement,text generation,etc.Which also achieved great success in these areas.We can find that most of these areas are text-to-text or image-to-image applications,and few people apply this method to the video field.On the other hand,according to researching a large number of papers,most of the current academic methods with deep learning methods to generate short-form video to generate natural language both use long and short memory networks or recurrence networks.But experiments show that long and short memory networks or recurrence networks appear information loss when dealing with time series problems.For short videos,this is a serious problem that needs to be solved because long-term input problems are encountered in both the short video coding phase and decoding phase.In order to solve the problems mentioned above,we first proposed a method for applying a generative adversarial networks to short video captioning.In this paper,we design a new discriminant network based on the traditional short video generation natural language.Based on adding discriminating the network to our framework,we can effectively judge whether the generated sentences are consistent and suitable for people to understand or whether associated with our video content.Experiments on the three public datasets of our method show that our model effectively improves the accuracy of short-video captioning.In particular,we propose cross and conditional Long-Short Time Memory Networks,respectively,in which the cross Long-Short Time Memory Networks can greatly reduce the information loss of short video in the encoding process through its special encoding process.In the decoding process,we designed a conditional LongShort Time Memory Networks which filter the impurity image information at each step and then inputting useful visual information into the decoder.By this way we can avoid the generated sentences only fit the distribution of the ground-truth sentence but miss the video content for the only once input of visual information at decoding process.Experiments with our model in three public datasets show that our approach surpasses most short-video generation methods of natural language.
Keywords/Search Tags:Computer Vision, Attention Mechanisms, Generative Adversarial Networks, Long-Short Time Memory Networks, Convolutional neural network
PDF Full Text Request
Related items