Image paragraph captioning aims to generate descriptive paragraphs automatically for a given image.It is more challenge than traditional image captioning task.It belongs to one of the emerging research topics of multimodal artificial intelligence.As the generation target expands from a single sentence to a multi-sentence paragraph,higher demands of the model’s visual cue organizing and text logic constructing capabilities are required.In addition,the automatic generation of semantic-rich paragraphs has broader application prospects.Current mainstream studies share the following problems:Firstly,structures within the paragraph are ignored,which could easily lead to content redundancy and incoherence.Moreover,relationships between regions of the image are neglected.An unstructured collection of regions is used to model the input image,which is insufficient to capture the overall details and leads to an incomplete description.To this end,we propose to explicitly model paragraph structures and region relationships by tree structures.Tree structures are then introduced into the image paragraph captioning models.Specifically,our works are as follows.Firstly,for the lack of paragraph structures,we design a hierarchical constructing method to build tree structures from the paragraph.The tree structures are used as supervision signals.In addition,we propose a novel tree-structured visual paragraph decoder network,called Splitting to Tree Decoder(S2TD).S2TD models the paragraph decoding process as a topdown binary tree expansion.Starting from the global image feature,the parental node is iteratively split into left and right child nodes.Leaf nodes are decoded into sentences forming a coherent paragraph.Secondly,for the lack of regional relation modelling,we design a heuristic constructing method to build region tree structures.The tree structures are input as guidance.We further propose a novel encoder network,called Tree Enhanced Encoder(TEE).By utilizing grouped results obtained from the region trees,TEE constrains the multi-head selfattention mechanism layer by layer.This results in a more comprehensive and accurate understanding of the image content.Experiments are conducted on Image Paragraph Benchmark Dataset.Through quantitative analysis and qualitative comparison,the feasibility and effectiveness of our proposed methods are verified.Experimental results show that introducing tree structures into the image paragraph captioning model improves the paragraph generation quality. |