| With the development of deep learning,the research of artificial intelligence on the fields of computer vision and natural language processing has moved from the single task to complex multi-task.Complex semantic recognition and description of visual content is a high-level task,which is a key node for computers to understand image content.Its goal is to identify visual relationships and generate image descriptions for a given image.Visual relationship detection refers to identifying the relationships between different objects from an image;Image caption requires generating a natural language description of the image.But traditional image caption ignores the art of language.In order to combine text with art,stylized image caption have been derived.Stylized image caption not only requires a correct understanding of the visual content of the image,but also requires generating a more creative description.In summary,based on the complex semantics of visual content,this thesis conducts the following research around visual relationship detection and image caption:(1)Research on Complex Visual Relationship DetectionFor complex visual relationship detection,this thesis proposes the Visual Relationship Detection Based on External Language(VRDEL),which combines internal vision and external language for visual relationship detection.Specifically,the VRDEL consists of a visual module and a language module.The visual module uses a object detection model to identify objects in the image,and recognizes the probability distribution of the relationship between the two objects through object context semantics and spatial location in vision.The language module uses a cross-modal retrieval model to establish a language knowledge database in connection with images to mine and infer the probability distribution of relationships between objects in semantics from it.Finally,the output of the visual and language modules are integrated using a late fusion approach.(2)Research on Complex Image CaptionFor complex image caption and explore the impact of visual relationship on image caption,this thesis proposes the Image Caption Network Based on Visual Relationship(ICNVR),which integrates visual relationships into image features to assist in automatically generating descriptions of images.Specifically,the ICNVR consists of a visual encoder and a language decoder.The visual encoder creates a scene graph using visual relationship model,and encodes it into visual features through a graph convolutional network.The language decoder uses a two-layer LSTM network to decode visual features to generate descriptions of images.In order to add emotion to the image caption,this thesis proposes the Multi-Style Image Caption Network Based on Transformer(MSICNT),which overcomes the defect that traditional image caption cannot generate appropriate language styles to express subjective emotions.Specifically,the MSICNT is based on the Transformer architecture and consists of three parts: an image caption network,a multi-modal style decoder,and a denoising auto-encoder.Image caption network generate factual descriptions that provide a priori knowledge for stylized caption;The multi-modal style decoder utilizes a cross fusion mechanism to learn multi-modal features to generate appropriate stylized caption;The denoising auto-encoder embeds stylized caption and factual caption into the feature space for metric learning to ensure that stylized caption retain as much information as possible related to image content. |