Font Size: a A A

Attention-aware Deep Cross-modal Hashing

Posted on:2021-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:H L YaoFull Text:PDF
GTID:2428330602983751Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
From production to life,from manufacturing to service,from industry to financial commerce,the door to the era of big data has opened and quietly changed this world.The Internet generates massive amounts of data every day and presents them in a variety of ways,such as text,image,video,audio,etc,which greatly enrich our lives.At the same time,the efficient storage and rapid retrieval of large-scale data have also been widely concerned as a very challenging task.Hashing methods can map high-dimensional features into compact low-dimensional binary hash codes,which have the characteristics of low storage consumption and fast retrieval speed,so they become important methods to solve the large-scale cross-modal retrieval tasks.With the rapid development of deep learning technology currently,its powerful feature extraction ability is used in multiple fields.The proposal of deep learn-ing makes up for the clumsy manual feature extraction and obtains the abstract and effi-cient feature representations.Therefore,many scholars have proposed the combination models of deep learning and cross-modal hashing to improve the retrieval effect.How-ever,many deep cross-modal hashing methods proposed in recent years still have some shortcomings:Generally,real-world data is imperfect and has more or less redundan-cy,making cross-modal retrieval task challenging.But most of the existing cross-modal hashing methods fail to deal with the redundancy,leading to unsatisfactory performance on such datasets.Taking images and text modalities as examples,many methods do not consider the richness of the image content in the original dataset.They just use the entire picture as the network input to extract features and then learn hash codes,which cannot focus on the key information of the image and makes redundant parts such as picture background affect the extraction of effective features;at the same time,for text annotation information,the original data contains a lot of noise interference,thus using such data may also affect the extraction of effective features;in addition,to improve the performance,many deep methods introduce some complex network models,such as generate adversarial networks,LSTM networks,etc.But the substantial increase in the number of parameters may lead to a substantial increase in time cost.In view of the above problems,we propose a new deep cross-modal hashing method-TEACH:aTtEntion-Aware deep Cross-modal Hashing,which could perform feature learning and hash-code learning simultaneously.It creatively proposes an attention-aware method,draws on the current popular attention methods in computer vision field.And we consider the ability of attention mechanism that can select a specific input(or feature)subset,so we introduce it into our cross-modal hash retrieval model.Specifi-cally,different attention modules are designed for samples of different modalities,so as to highlight the key parts and reduce the contribution of redundant interference terms in the retrieval task.What's more,in order to solve the problem that the training time in-creases significantly after the introduction of other complex mechanisms in some deep network models,this thesis completes the acquisition of two local attention maps in the pre-training stage.The classification time complexity of this step is O(n),which is far less than the training time of the deep hashing network model that use the similarity ma-trix as supervised information.At the same time,the simple parameters of classification network have greater advantages compared with other complex models such as generate adversarial networks.Therefore,compared with some simple deep cross-modal hashing methods,the training and retrieval time of TEACH does not increase greatly.In order to verify the effectiveness of our proposed model,multiple experiments has been conducted on three common benchmark datasets,i.e.,MIRFlickr-25K,NUS-WIDE and Wiki,while comparing it with the current effective cross-modal hashing retrieval models.It shows that TEACH is effective and practical.
Keywords/Search Tags:Learning to hash, Cross-modal retrieval, Attention mechanism, Approximate nearest neighbor retrieval, Deep learning
PDF Full Text Request
Related items