| Image retrieval aims to provide people with convenient and fast image search functions.With the advent of the era of big data,the information data on the Internet is expanding exponentially.In these data,image resources account for a large proportion,so how to retrieve images from massive images accurately and efficiently has become one of the important research topics today.Currently,the existing deep hashing methods mostly use deep convolutional neural networks as the backbone network.Although they have many excellent characteristics,such as translation invariance,inductive bias and so on,they also have limitations,that is,they require multiple convolutional operations to obtain global information,which is also the reason why the current mainstream deep convolutional neural network framework is becoming increasingly complex.However,multiple convolution operations can lead to the loss of many shallow details in the final extracted feature map,affecting the accuracy of generating binary hashing codes.At the same time,complex network structures can also increase the training time of the model.As a computer vision model with good performance in only a few years,Transformer model can directly capture global relationships,increase receptive fields in relatively little time,and retain richer feature semantic information compared to deep convolutional neural networks,but the computational complexity is relatively large.Moreover,most deep hashing networks ignore the importance of the classification layer.In response to the above issues,this paper has conducted the following two aspects of research:(1)This paper proposes a residual feature extraction network that combines Transformer and channel attention mechanisms.The core algorithm of Transformer is the self-attention mechanism,which can capture the correlation between local features and global features of an input image through a single encoding operation.However,for color images with three channels,there is interdependence between different channel features.In order to avoid waste of computing resources,this paper introduces the Channel Attention Mechanism(CAM)into the Vi T model,assigning different characteristic values to different channels.At the same time,in order to solve the problem of large computational complexity of the Vi T model,a space reduction layer is designed to maximize the use of limited computational power.The experimental design part compares the residual feature extraction network proposed in this paper with Alex Net,Res Net-50,and VGGNet-16 for classified tasks on dataset CIFAR-10 and CIFAR-100.The experimental results show that the improved Vi T model focus on local locations related to tasks,enhance the feature extraction ability of the network,and utilize computing resources compared to other attention models more effectively.(2)In the network designing process,this paper uses a deep convolutional neural network to extract the shallow features of the input image,and uses the designed residual feature extraction network that integrates Transformer and channel attention mechanism as the deep feature extraction network of this model.Considering the role of classification loss,a classification layer is added at the end of the network to learn classification loss,making the network maintain pairwise similarity and generating more accurate hashing codes that is consistent with the sample label.It is proved that the classification layer can improve the retrieval performance of the model through ablation experiments.Comparative experiments were conducted on CIFAR-10 and NUS-WIDE dataset using this method and other deep hashing methods,with the mean average accuracy(m AP)as the evaluation standard.Experimental results show that the m AP of the proposed method is higher than other depth hashing methods verifying the superior performance of the proposed method in image retrieval. |