Font Size: a A A

Research On Referring Expression Comprehension Based On Semantic Context

Posted on:2021-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y M GaoFull Text:PDF
GTID:2428330605474903Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Referring expression comprehension has been a hot research topic in recent years.It involves two major fields of computer vision and natural language processing,and has a very broad application prospect.Based on the framework of joint embedding space,we extract the feature of visual and text,mapping the two features to the common embedding space,and finally calculate the similarity score via matching function to complete the referring expression comprehension.Considering that semantic context is an important expression way for the brain to locate and describe the scene objects,this thesis studies referring expression comprehension from the two aspects of text semantic context and visual semantic context.The main research work is as follows:(1)In order to overcome the problem that the semantic information of referring expression can not fully capture due to the inability to express the difference of the word order or grammatical structure in the feature extraction of text modality in the existing methods,we propose a grammatical text semantic context based referring expression comprehension method.Firstly,we use Standford Parser to generate the refered grammatical parse tree,and build a dynamic model through the algorithm of computing graph generation.Then,the tree-structured Long Short-Term Memory(Tree-LSTM)network is applied to extract the text semantic context to enhance the text features of each node in the computing graph.Finally,the similarity matching between the enhanced text features of the nodes and the visual features extracted by the Convolutional Neural Network(CNN)is completed by the dynamic model.Experiments on RefCOCOg dataset show that this method effectively utilizes the text semantic context contained in the syntax structure,enhances the ability to express the text features,and achieves a higher accuracy detection of the objects and relevant objects in the image.(2)For overcoming the problem of the insufficient capability to describe the text features of the referring expression in different modules because of the deficiency of the information mapping between image and referring expression in most existing methods,a method of text semantic context referring expression comprehension based on multi-modal relationship is proposed.This method extracts the three kinds of features:subject,location and relationship from visual features and text features,and extracts text semantic context based on the multi-modal relationship between the low-level visual features and high-level phrase features to enhance the corresponding type of text features.Finally,the method matches the three kinds of visual features with text features respectively,and obtains the final total similarity.Experimental results show that the text semantic context of multi-modal relation effectively represents the interactive information between multi-modal features and enhances the text features.This method can guide the alignment of cross-modal information and improve the accuracy of referring expression comprehension.(3)For improving the problem of the weak detection ability of similar objects is caused by the insufficient visual semantics of the object in the existing methods,a referring expression comprehension method based on visual semantic context is proposed.This method mainly focuses on the feature enhancement of visual and text.On the one hand,the visual semantic context is extracted by using the co-attention mechanism,to enhance the subject module to "pay more attention to" attribute information.On the other hand,the visual feature of the relationship module is enhanced through the potential relationship between the object and the related object.The experimental results demonstrate that the method effectively boosts the visual feature and attribute information in the referring expression,and significantly improves the performance of referring expression comprehension on the RefCOCO,RefCOCO+ and RefCOCOg datasets.In conclusion,the researches on the following three aspects demonstrate that the extraction of semantic context can effectively improve the performance of referring expression comprehension in multimodal tasks:enhancing text features only by extracting grammatical text semantic context,establishing the mapping between text high-level semantics and visual low-level semantics,and extracting visual semantic context to complete the matching of high-level features between text and visual modality.
Keywords/Search Tags:referring expression comprehension, semantic context extraction, attention mechanism, multimodal, convolutional neural network
PDF Full Text Request
Related items