| Image captioning is an important research direction in the field of artificial intelligence,aimed at enabling computers to understand images and generate accurate natural language descriptions.Image description is a multimodal generation task that converts visual content of an image into textual language descriptions,typically using an encoder decoder structure.This task requires accurate description of image content by detecting objects present in the image and understanding their semantic relationships.Image captioning is widely used in image annotation,visual navigation,children’s education,and life assistance for visually impaired individuals.The current image description models lack attention to the location information of objects in scene images,making it difficult to generate accurate and complete text descriptions.Moreover,the time and storage cost required to train this type of multimodal model are too high.Therefore,this article combines graph convolution structure,Transformer structure,and depth separable convolution structure,and utilizes the advantages of these three structures to enable the model to generate caption faster and more accurately.This article conducts relevant research on image description generation of object position relationships in fusion scenes,as follows:1.Existing methods lack attention to the implicit position information in images,making it difficult to accurately describe the position relationship of objects.To address this issue,this article proposes an image description method based on the position relationship of objects in the scene.In this method,firstly,a position relationship encoder is used to encode the node features of the object relationship scene graph for the first time.A method for constructing a position relationship encoder containing a secondary encoding mechanism is proposed.This method calculates the degree of imbalance between objects,combines a common dictionary and a inference module,and performs secondary encoding on the object relationship nodes;Secondly,combining the common sense dictionary and inference module,a joint decoder combining erasure module and bias gating mechanism is proposed to further optimize the node features in the object relationship scene graph,calculate the degree of imbalance between objects,and perform secondary encoding on object relationship nodes based on this degree value.Once again,a joint decoder is used to process the encoded information,and the erasure module and bias gating mechanism are used to further optimize the node features in the graph.Finally,generate a text description corresponding to the image.The proposed method was experimentally validated on a public dataset,and improved in various evaluation indicators compared to existing methods,achieving significant results on the CIDEr indicator.2.The existing image description methods often have the problem of having too many model parameters,which leads to an increase in training costs.To address this issue,this paper proposes an image description method based on lightweight convolutional structures.In this method,firstly,a lightweight deep separable convolutional network is used to extract image features;During the initial encoding,a lightweight encoder combining the fusion of object position relationships encodes the visual features of the image;Propose a joint Transformer decoder to decode the corresponding text description from the input lightweight visual features and global image semantic features.The proposed method was experimentally validated on a public dataset and achieved significant results on the CIDEr metric.It can reduce model parameters and reduce the time cost of generating text descriptions without compromising model performance. |