Font Size: a A A

Semantic Analysis For Cross-Media Data

Posted on:2020-10-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:W F ZhangFull Text:PDF
GTID:1368330605466654Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cross-media data is defined as the data coming from different modal and different viewpoint,but used to describe the same concept or event.With the rapid development of information techniques and mobile internet,cross-media data such as image,text,audio and video is booming and has been constantly changing the way of people's living and working.Understanding the semantics of the cross-media data using artificial intelligence techniques has become a research focus.Semantic ayalysis for cross-media data is the basis of various cross-media applications including Automatic Image Annotation(AIA),Cross-Media Information Retrieval(CMIR),and Visual Question Answering(VQA).There are still huge challenges for cross-media sematic analys,mainly including:(1)"semantic gap" between low-level features and high-level semantics for each modal data.(2)"heterogeneous gap"which leads to inconstant distributions and representations of the data from different modalities.These gaps hinder the analysis and application of cross-media data.How to model the semantical correlation between different modalities is the key of cross-media semantic analysis.One typical solution is cross-media unified representation,which builds a common semantic space,in which Euclidean distance or cosine distance can be used to measure the similarity between data from different modalities.However,traditional cross-media unified representation methods use global features,thus they can not match the fine-grained information between modalities and introduce noise.Another kind of solution is based on feature fusion,which fuses multimodal features and mine semantical correlation between modalities.The key of feature fusion is to capture the complicated correlation between modilities,which still needs further investigation.Targeting at the cross-media data,mainly including image and text data,this thesis addresses two issues in understanding semantics of cross-media data,including the semantic gap between visual features and semantics of images,and the heterogeneous gap between cross-media data.By leveraging the strong learning ability of deep neural network,we aim to mine the fine-grained correlation information between modalities to improve the performance of cross-media semantic analysis.The main contributions include:1.We propose a framework based on semantical learning to rank,to enhance the semantics of visual features.And it is used to address image auto-annotation problem.Cross-media semantic enhancing aims to find an effective mapping based on the correlation between visual features and textual features.Based on the discrimitive textual features,the cluttered distribution of visual features can be improved by applying this mapping.We design a simple but effective neural ranking model to achieve this mapping.Unlike typical learning to rank algorithms for image auto-annotation which directly rank annotations for image,our approach consists of two phases.In the first phase,neural ranking models are trained to rank image's semantic neighbors.Then nearest-neighbor based models propagate annotations from these semantic neighbors to the image.Thus our approach integrates learning to rank algorithms and nearest-neighbor based models,including TagProp and 2PKNN,and inherits their advantages.To evaluate the ability of our proposed model,we conduct extensive exprements on four popular datasets.Experimental results show that our method effectively alleviates the gap between visual features and semantics of images,and improve the image annotation performance.2.We propose a kind of text representation based on textual relationships.Based on this text representation,a novel cross-media retrieval model named SCANet is also proposed.Most of existing work represent cross-media data as "flat" features for both irregular-structured text and grid-structured data(e.g.image belongs to 2D-grids),which ignores the important inherent explicit or implicit relational information.In this dissertation,we adopt graph models to model text by integrating word-level semantic relationship.We construct global-share graph structures and context-specific graph features.Based on such graph representations,we utilize Graph Convolutional Networks for generating relation-aware textual representations.This text representation is used in SCANet.Furthermore,we propose a stacked co-attention network to progressively learn the mutually attended features of different modalities and enhance their fine-grained correlations.These correlations can help our model achieve fine-grained semantic alignment in the common semantic space.In addition,metric learning is adopted to learn distance metric between image and text representations in the common semantic space.Experimental results on five popular datasets demonstrate that SCANet can effectively alleviate the heterogeneous gap between cross-media data.3.We proposed a cross-media fusion apprpach based on visual relational reasoning and attention,which is used to address CMIR and VQA tasks.The proposed cross-media fusion method can capture the complicated correlation between visual features and textual features,resulting discrimitive fused features for downstream tasks.Our cross-media fusion approach is composed of a visual relational reasoning module and a visual attention module.The visual relational reasoning module can not only reason the the relationship between two objects,but also mine the relationship among multiple objects.The visual attention module based on bilinear model can achieve fine-grained feature interactions.Visual attention module can enhance the objects related to the question and visual relational reasoning module can mine the relationships between objects and fuse them into the final image representation.By combining these two modules together,our VQA model and CMIR model effectively alleviates the heterogeneous gap between cross-media data.
Keywords/Search Tags:Automatic image annotation, Cross-media retrieval, Visual question answering, Deep learning, Visual understanding
PDF Full Text Request
Related items