Font Size: a A A

Research And Implementation Of Unsupervised Image Captioning Method With Unseen Object Detection

Posted on:2021-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:L L JiFull Text:PDF
GTID:2428330614971209Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rise of deep learning technology,natural language processing and computer vision have been further integrated.Image captioning is a new interdisciplinary problem which is the intersection of computer vision,natural language processing and artificial intelligence.Image captioning is to use text to describe image content,that is,talk with pictures.Image description generation can not only help people with visual defects,but also classify and summarize large-scale images and videos.Image captioning not only needs to recognize the important objects,attributes and their relationships in the image,but also needs to generate sentences with correct semantics and syntax.Image captioning algorithm based on supervised learning needs huge manual annotation cost,so this paper proposes a new unsupervised image captioning algorithm.In the existing unsupervised image captioning algorithm,there is a lack of attention mechanism.The image is directly encoded into a fixed length feature vector.In the decoding process,only global features are used to generate word,while the human eye will pay attention to some local areas.Most of the existing image captioning is based on supervised learning.Because of the problem of data scale,there are no more than 100 kinds of object categories in the generated description statements.Therefore,how to identify those unseen object classes that only appear in the test set relative to the training set is the second problem existing in the existing model framework.In view of the above two problems,the research work of this paper is mainly divided into the following two parts:(1)The attention mechanism is integrated into the existing unsupervised image captioning algorithm.At present,unsupervised image description generation algorithm encodes the whole image,which leads to the lack of "attention" to each significant region in the process of temporal information processing.Therefore,we improve the unsupervised image captioning algorithm framework and add integrate attention mechanism.We set the generator model as a Bi-LSTM,one is Attention LSTM,the other is Language LSTM.In each time step,the low-level features extracted from Faster RCNN network and the output of Attention LSTM are input into the Language LSTM,so that each word prediction will focus on different areas of the image and associate the predicted words with the significant regions.(2)The fusion of zero-shot learning algorithm improves the performance of unsupervised image generation algorithm.Most of the existing image description generation methods have such limitations: they cannot extend the target recognition class.The purpose of introducing zero-shot learning object detection is to correctly identify the classes that appear in the testing process but not in the training process.We will use the concept of meta-classes in semantic space.The unseen classes and background classes are classified into a super-class,each unseen class is detected in the super-class,and the scale of superclass is constantly reduced.The performance of our algorithm can be verified by F1 index evaluation on MS COCO dataset.
Keywords/Search Tags:Image captioning, Attention mechanism, Unsupervised learning, Zeroshot learning
PDF Full Text Request
Related items