With the rapid development of the mobile Internet era,the amount of multimedia content on the Internet has exploded.As one of the most commonly used information media for human communication,images can intuitively express rich visual semantics.Image captioning aims to enable computers to understand the content of images like humans through learning methods,and use natural language generation technology to generate corresponding text descriptions.Image captioning crosses the semantic gap between image and text,as a result,it plays a key role in the field of information retrieval,human-computer interaction,etc.Today,most image captioning methods use deep generative models to flexibly generate text descriptions that match the semantics of images.However,solely relying on generative models to generate captioning still face the problem of meaningless generation,logic error generation,syntax error generation,long sequence modeling difficulties,etc.At the same time,retrieval-based methods for image captioning can produce descriptions that are rich in detail and grammatically correct.However,due to the lack of flexibility of retrieval-based methods,they cannot produce the best matching captioning according to the content of the images.To this end,this thesis explores how to effectively combine the advantages of generative methods and retrieval methods to improve the performance of image captioning.The main research contents of this thesis include:1.Propose an image sentence captioning method based on the fusion learning of retrieval model and generative model.To overcome the problem of generating accurate semantics in the description on image salient regions,this thesis adopts an image semantic similarity based retrieval model and proposes a language generation model to combine the retrieval knowledge.To further utilize the retrieval knowledge,this thesis devises a copy mechanism to introduce relevant words in retrieval results into text generation distributions.In order to further enhance the effect of the model on semantic accuracy,this thesis proposes an interactive dual adversarial training mechanism to better fusion retrieval ranking discriminators and captioning generators.2.Propose an image paragraph captioning method based on the hierarchical fusion learning of retrieval knowledge reasoning model and generative model.To conquer the problem of meaningless and irrelevant long sequences descriptions generated by the language model,this thesis constructs a scene graph for image paragraph captioning.By introducing the semantic of retrieving knowledge,the auxiliary model completes hierarchical sentence-level topic modeling and word-level semantic modeling and generates accurate and fluent long text descriptions guided by the related description and the knowledge triple topic model.In addition,this thesis builds an image paragraph captioning dataset in Chinese.3.Develop a visual aid system for image captioning.To fulfill people’s needs for public image captioning visual aid services,this thesis develops an online image captioning system by integrating aforementioned algorithms.The system applies the retrieval and generative fusion learning scheme to describe the semantics of the image.Furthermore,this thesis provides controllable generation tools and visualizes the generation results to the user. |