The task of image captioning involves automatically generating a descriptive text that accurately represents the main content of a given image while being fluent in language.It is a cross-modal research task that requires understanding and generation of both image and text modal data.Research on image captioning has significant theoretical and practical implications for multimodal retrieval,autonomous driving,and natural language generation.However,most image captioning research is focused on English,with less attention given to Chinese.As a result,there is much room for improvement in Chinese image captioning research.This paper aims to address this gap by conducting research on Chinese description generation.Most of the current image captioning algorithms use convolutional neural networks and visual Transformer structures to extract feature vectors of a given image,which have high complexity and computational redundancy due to the structural characteristics of both networks.Meanwhile,the decoders of these methods all use a single-round decoding structure for describing text generation,but this approach suffers from inadequate decoding and error accumulation computation.To address the above problems,the research in this paper focuses on the construction of Chinese-oriented image captioning algorithms,and the main work and innovations include:1.A wave-fusion-based generation model for image captioning of visual multilayer perceptron is proposed.The encoder part of the model consists of only multilayer perceptron structure and activation layers stacked,which greatly simplifies the model structure and reduces the number of training parameters.At the same time,a wave function is used to represent the image blocks,which is used to dynamically interact with the spatial information of the images,which is conducive to the model to extract image features more efficiently.A language decoder based on the memory attention mechanism is proposed,which can effectively simplify the computation of traditional attention mechanisms.The mechanism also tracks the overall characteristics of the dataset during the training process and thus balances the influence of other samples to improve the quality of model generation.The experimental results show that the results of the model have a significant improvement compared to the baseline model for several of the compared metrics.2.A model for image captioning based on knowledge guidance and dynamic multi-round decoding is proposed.A structure for dynamic multi-round decoding was constructed to determine the re-decoding operation using a judging mechanism,which is capable of polishing the initially generated text vector,making full use of the image information in the process while correcting previously generated errors.A semantic enhanced attention mechanism is proposed,which is used to explore the incorporation of external linguistic knowledge into the decoding vector to guide the decoding process of the model as a way to enhance the language generation capability of the decoder.Finally,it is shown experimentally that the quality of the text generated by the model is further improved on several metrics that are of high relevance to humans. |