| With the emergence of food data on the Internet,food-related retrieval technology is necessary for users to obtain effective food-related information.Cross-modal recipe retrieval related to food has gradually become a research topic.Cross-modal recipe retrieval can realize mutual retrieval between food images and recipes.Since food images and recipes are different modalities with heterogeneous gap between them,it is difficult to directly measure their similarities.Therefore,how to alleviate the heterogeneous gap has become a hot research topic in cross-modal recipe retrieval.The realization of cross-modal recipe retrieval is mainly divided into three processes.The first step is to extract the features of food images and recipes.The second step is modality alignment in the joint embedding space based on the construction of two modalities.The last step is cross-modal learning through triplet loss to achieve retrieval.The current cross-modal recipe retrieval studies can be classified into two categories.One type of methods uses the adversarial loss to map food images and recipes to a common space and completes the similarity calculation,but the feature extracted by this type of method is insufficient,which affects the effect of modality alignment;Another type of methods mainly uses reconstruction loss,this method assumes fixed distribution and regenerate feature distribution according to learned features to realize modality alignment.However,this type of method uses weak alignment of newly generated features,resulting in a lack of interaction between features of two modalities.Based on the above problems,we focus on the first two key steps in the process of cross-modal recipe retrieval from the following two aspects.1.We propose a cross-modal recipe retrieval model based on multimodal attention interaction.Firstly,we use convolutional neural network and recurrent neural network to extract the initial features of food images and recipes respectively.Then,stacked cross-modal attention network is used to realize the modal interaction between food images and recipes.After that we utilize self-attention mechanism to supplement the internal information of food images and recipes respectively.So,we can capture the inter-and intra-modality features sufficiently.With adversarial learning,our method aims to enhance the distribution consistency of the two modalities.Finally,the triplet loss function is used to achieve cross-modal recipe retrieval.Experimental results show that this model greatly improves the performance of cross-modal recipe retrieval.2.We propose a cross-modal recipe retrieval model based on multiple alignment using triplet loss.Firstly,Vision Transformer is used to extract the features of food images,and hierarchical Transformer encoder is utilized to extract the features of ingredients and instructions respectively.Then,we use the triplet loss to align the food image with the ingredient features and the instruction features respectively.It can optimize the distance alignment of the food images and recipes in the joint embedding space and increase the feature interaction between the two modalities from the bottom-level.Then adversarial learning is applied to enhance the consistency of the two modalities from global perspective.Finally,the triplet loss is used for the cross-modal recipe retrieval.Experimental results on real dataset show that this model improves the performance of cross-modal recipe retrieval. |