| Image similarity retrieval is one of the indispensable services in apparel e-commerce platform.Most existing retrieval methods are based on the overall similarity of apparel images,without paying attention to the similarity of apparel under specific attributes.For example,a red long-sleeved V-neck T-shirt and a blue short-sleeved V-neck T-shirt are not similar from the overall point of view,but are similar in the neckline design.This kind of search can better meet users’ personalized search requirement and promote consumer spending.In this paper,we focuses on attribute-aware image similarity learning methods for fine-grained image retrieval tasks.Bridging the gap between visual feature encoding and attribute semantics is the core problem of fine-grained similarity retrieval.We proposes three basic tasks to solve this problem,include attribute localization,feature extraction and feature fusion.Attribute localization methods based on spatial attention mechanism to locate key regions of images have been widely studied and applied,but their accuracy still needs to be improved.General methods such as downsampling are difficult to collect effective feature information from clothes image.The commonly used feature fusion methods such as concate are difficult to richly express multiple levels of feature information.To address these problems,this paper proposes the similarity learning network based on attention mechanism(Attn Fashion)and iterative localization similarity learning network(ISLN).We make a lot of experiments on two public datasets in the field of clothing retrieval to verify the effectiveness of the models.In summary,this paper makes the following contributions:(1)To address the problem of clothing attribute localization and the semantic gap between image and text modalities,we propose the attention mechanism-based similarity learning network Attn Fashion.Attn Fashion has two core attention modules,where the attribute-guided spatial attention locates key regions of images under the guidance of attributes,and the attributeguided channel attention module extracts attribute information within multiple channels.Finally,we use an adaptive feature fusion mechanism to adaptively fuse the key features extracted by spatial attention and channel attention,so the features contain richer semantic information.(2)Inspired by the process of focusing on key locations of images by human vision,we propose iterative localization similarity learning network(ISLN).ISLN contains two core modules,the Whole module is responsible for locating key regions of images and producing new feature maps,and the Part module is responsible for fusing feature maps and attribute features to generate new attribute features.During the iterative process,ISLN is able to fuse multiple levels of semantic information and continuously adjust the regions of attribute localization to optimize the extracted attribute feature information.(3)We conducted extensive experiments to verify the performance of Attn Fashion and ISLN in a common dataset of fashion field.Besides,We also make a lot of comparison experiments and ablation experiments to test the rationality of the model structure.The experiments show that Attn Fashion and ISLN outperform the existing models.They can effectively locate the key regions of images and extract the features of attributes. |