| Image-text matching is to measure the correlation between image and text.It is an important part of cross-modal retrieval and has potential applications in search engine,ecommerce recommendation and content understanding.It has become a cutting-edge topic in the cross field of Computer Vision,Natural Language Processing and Applied Mathematics.This paper focuses on the transformation of image features into multiple text labels so that they could be calculated the similarity with the keywords representing the target text features.Based on the network structure of multi-level graph convolution,an image-text matching system based on the fine-grained multi label image classification algorithm is proposed.The main contents are as follows:First of all,in order to deeply mine the potential relationship between image label categories,we choose the ML-GCN(Multi Label Graph Convolution Network)as the initial model of multi-label image classification,which is learned from the co-occurrence probability of image labels and the embedding representation of category labels by using graph convolution.On this basis,using the hierarchical information of text label categories,the domain semantic hierarchy tree and corresponding multi-level coding are established.For each level of the tree structure,it is an independent multi-label classification sub-task.In the same way,the graph convolution module is used to learn the association between the semantic category labels of the same level,and the prediction results of the upper level are concatenated together with the compressed image features as the input of the next level.Hier-GCN is obtained by using hierarchical classification of multi-level semantic tag as auxiliary task,which effectively improves the result of multi-label classification of original image.Secondly,on the basis of the use of multi-level hierarchical graph convolution classifiers,for practical application scenarios,the prior information in the knowledge map is introduced.And the KGE(Knowledge Graph Embedding)which is not affected by the timeliness of the source of label training corpus is used to obtain relatively stable word vector representation,and the model KG-Hier-GCN is proposed,which makes the overall performance of the system more stable.Using the Attention mechanism which can mine fine-grained features and design the adaptive weighted loss corresponding to the truth label to improve the performance of the model and accelerate the convergence of loss and the learning of the network.Finally,the matching degree of image and text is evaluated by calculating the similarity between the output of multi-label image classification and the core keywords extracted from the target text,and whether the image tag hits the target text paragraph or title.The mAP value of 0.8413 is achieved on the open-source dataset COCO of multilabel image classification,and 0.9461 is achieved on the VOC dataset.The hit-1 value of 0.4130 and the hit-10 value of 0.4049 are achieved on the test set of TencentEntertainment in the actual application scenario.It can be used as an effective support for image-text matching task,and also provide general auxiliary information for image searching. |