Image Paragraph Captioning Based On Tree Structures

Posted on:2023-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Shi

Full Text:PDF

GTID:2568306914471704

Subject:Intelligent Science and Technology

Abstract/Summary:

PDF Full Text Request

Image paragraph captioning aims to generate descriptive paragraphs automatically for a given image.It is more challenge than traditional image captioning task.It belongs to one of the emerging research topics of multimodal artificial intelligence.As the generation target expands from a single sentence to a multi-sentence paragraph,higher demands of the model’s visual cue organizing and text logic constructing capabilities are required.In addition,the automatic generation of semantic-rich paragraphs has broader application prospects.Current mainstream studies share the following problems:Firstly,structures within the paragraph are ignored,which could easily lead to content redundancy and incoherence.Moreover,relationships between regions of the image are neglected.An unstructured collection of regions is used to model the input image,which is insufficient to capture the overall details and leads to an incomplete description.To this end,we propose to explicitly model paragraph structures and region relationships by tree structures.Tree structures are then introduced into the image paragraph captioning models.Specifically,our works are as follows.Firstly,for the lack of paragraph structures,we design a hierarchical constructing method to build tree structures from the paragraph.The tree structures are used as supervision signals.In addition,we propose a novel tree-structured visual paragraph decoder network,called Splitting to Tree Decoder(S2TD).S2TD models the paragraph decoding process as a topdown binary tree expansion.Starting from the global image feature,the parental node is iteratively split into left and right child nodes.Leaf nodes are decoded into sentences forming a coherent paragraph.Secondly,for the lack of regional relation modelling,we design a heuristic constructing method to build region tree structures.The tree structures are input as guidance.We further propose a novel encoder network,called Tree Enhanced Encoder(TEE).By utilizing grouped results obtained from the region trees,TEE constrains the multi-head selfattention mechanism layer by layer.This results in a more comprehensive and accurate understanding of the image content.Experiments are conducted on Image Paragraph Benchmark Dataset.Through quantitative analysis and qualitative comparison,the feasibility and effectiveness of our proposed methods are verified.Experimental results show that introducing tree structures into the image paragraph captioning model improves the paragraph generation quality.

Keywords/Search Tags:

multimodal artificial intelligence, deep learning, image paragraph captioning, tree structure

PDF Full Text Request

Related items

1	Research On Image Captioning Methods Based On Deep Learning
2	Research On Image Paragraph Captioning Method Based On Deep Learning
3	Paragraph Image Captioning Based On Convolutional Neural Network
4	Research And Application Of Image Paragraph Captioning Based On Relations Encoding And Attention Mechanism
5	Research On Academic Figure Captioning Based On Deep Learning
6	Deep Multimodal Attention Learning For Image Captioning
7	Image Captioning Methods Based On Fusion Learning Of Generative Model And Retrieval Model
8	Research On Visual Captioning Based On Deep Learning
9	Research On Video Captioning Based On Deep Learning
10	Research On Multimodal Emotion Recognition Based On Natural Language Characteristics