Font Size: a A A

Visual-semantic Alignment For Object Localization

Posted on:2020-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:C J YangFull Text:PDF
GTID:2428330590974106Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Deep learning has greatly improve the performance of object recognition and object detection,but what they basically do is to align visual information with a finite set of semantic symbols.This paper aims to move their target a step further along the way to the general artificial intelligence and align the visual information with language descriptions of free form so that agents are enabled the ability to understand any referring expressions and to localize the described object region in their visually perceived data,i.e.,the image,just as what humans can do.The task concerned is from the cross field between computer vision and natural language processing,and takes image and text as inputs while outputs the occupied region of the target object.Due to the complexity of recognition,this paper utilizes deep neural network as the model which can learn to make a good prediction of the ground truth through training on a large dataset.The network is composed of a visual subnet which learns visual feature from the image input,a semantic subnet which learns language feature from the text input,and an alignment subnet which calculates the correlation between the two kinds of features and manages to localize the best matched target region.With this framework assumption,this paper further diagnosed the visual features learned from popular feedforward convolution networks and demonstrated their lack of both semantic information and good resolution for the current task,which consequently damages the alignment with language information and deteriorates the accuracy.To address this problem,this paper proposed a feature fusion scheme,which fuses the features layer by layer in a top down manner.The resulted features can represent visual information more completely and accurately,thus improves the alignment performance.On the other hand,this paper modeled the context information for each word in the text which helps to eliminate the inherent ambiguity in the language.This context-adaptive language representation promoted the alignment result.Ablation experiments demonstrated the obvious benefits of the two kinds of featuring methods above.The most important contribution of this paper lies in the design of alignment subnet.To state its necessity and motivation,the paper analyzed the inferiority of full alignment methods which fail to exploit the object context information and it also discussed the language variance problem facing the existed modular alignment methods.In order to overcome their difficulties,this paper proposed an adaptive modular alignment methods.This alignment model consists three parallel modules which are inner module,pair module and global module and they calculate the alignment scores respectively according to the object attributes,the pair relationship with surrounding objects and the global relationship with the general scene.Moreover,the model can adaptively adjust the modular weights to address the language variance and comprehend any expressions without imposing structural restrictions.This modular decomposition can effectively utilize contextual image regions of objects while reduce the overall semantic complexity,both of which contribute to boost the general performance.Experiments demonstrate its effectiveness and superiority.
Keywords/Search Tags:visual grouding, visual-semantic alignment, modular network, feature fusion, context related language representation
PDF Full Text Request
Related items