The task of paragraph image captioning task aims to generate a descriptive paragraph for a given image.As an important research direction of cross-media intelligence,it connects two significant areas:computer vision and natural language processing.The research progress of this task is very important to break the semantic gap between images and text.In recent years,thanks to the sequence modeling capabilities of the RNN(Recurrent Neural Network)family,hierarchical RNN decoders have been widely used in paragraph image captioning.However,the limitations of the RNN structure make such methods have the following problems.First,due to the limited ability to capture long-term information,RNN is difficult to generate long paragraphs.The generated paragraphs are not such coherent.In addition,the serial structure of RNN results in higher training time complexity and lower efficiency.Inspired by the characteristics of the CNN(Convolutional Neural Network),we conduct the following works.A paragraph decoder based on a fully convolutional neural network is proposed.The gated structure is incorporated into a hierarchical CNN decoder,which has stronger long-term memory capabilities and the ability to train in parallel.A metric to measure the coherence of paragraphs is proposed.We perform the experiments of evaluation metrics,coherence metric,training time complexity and subjective analysis on the Stanford image-paragraph dataset.We come to the conclusion that our decoder improves the quality of the generated paragraphs.Dual-CNN,a paragraph image captioning model that integrates regional attention is proposed,in order to enhance the image comprehension ability.A Metric for measuring the diversity of sentences in a paragraph is proposed.Through the experiments of evaluation metrics,diversity metric,regional attention analysis and subjective analysis,Dual-CNN significantly improved the performance of the paragraph image captioning task. |