Font Size: a A A

Research On Image-text Cross-modal Hash Retrieval Based On Semantic Preservation And Attention Mechanism

Posted on:2024-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:J L HongFull Text:PDF
GTID:2568307178473944Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the exponential growth of network devices and the development of mobile networks,a large amount of multi-modal data has emerged on the Internet,and there may be semantic correlations among different data modalities.As the amount of data increases,traditional cross-modal hash retrieval methods may result in difficulty extracting effective features,thereby reducing retrieval performance.To address this issue,methods relying on deep cross-modal hash retrieval technology have recently received increasing attention.Currently,many scholars have utilized neural network models to extract features as a replacement for traditional methods and have made many effective works.Through the analysis of the current methods,some shortcomings are still found,such as the lack of accuracy in feature extraction of multi-modal data,differences in similarity measurements between multi-label and hash code generated by hash function,and how to balance the accuracy and efficiency of the model.To address these issues,in this paper,the following work is done:To address the issue of measurement differences between similarity coefficients in cross-modal multi-label retrieval,an interval parameter is used to correct the bias.Additionally,the Transformer structure,which has shown excellent performance in various computer vision and natural language processing tasks,is introduced into cross-modal hashing retrieval.A new supervised hashing method,called Deep Semantics Preserving Vision Transformer Hashing(DSPVTH),is proposed.This method maps different modal data into binary hash codes using network structures such as Vision Transformer,and maintains semantic correlations between different modal data using multi-label similarity relationships.The effectiveness and robustness of DSPVTH are demonstrated through validation on four classic multi-modal image-text datasets,with average precision values 2% to 8% higher than the current state-of-the-art methods.However,the DSPVTH method mentioned above has a large number of model parameters and computation requirements,which can affect its efficiency in cross-modal hashing retrieval,even though it achieves high experimental accuracy.Therefore,a lightweight model is used to extract features,with parameters and computation requirements controlled below the current benchmark level.In addition,to address the problem of soft label ignoring in cross-modal hashing by lightweight pre-trained models,a new intermediate-level optimal feature extraction module is constructed in the middle layer to integrate secondary but still important features into the hash representation process.Combining the interval parameter method used in the previous method,a new lightweight supervised hashing method called Lightweight Cross-modal Attention mechanism Hashing(LCAH)is proposed.This method can extract the features ignored in the middle layer,and the fused representation is better.It has a similar number of parameters as the current baseline method,but has lower computational complexity.The effectiveness of this method is verified on four classic benchmark datasets.
Keywords/Search Tags:cross-modal hashing, semantics preserving, attention mechanism, supervised learning, deep learning
PDF Full Text Request
Related items