Font Size: a A A

Research On Cross-modal Applications Via Exploiting High-level Semantics

Posted on:2021-08-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:T T QiaoFull Text:PDF
GTID:1488306221992369Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of digital,web,and visual communication technologies,multimedia data is also showing a growing trend.These rich data resources present many new challenges for information exploiting,understanding and analysis of cross-modal data and other related research.For example,information retrieval has moved from single media to multimedia.In addition,the diversity of data and the diverse needs of users have spawned many emerging cross-modal tasks,such as the visual question answering,where the model needs to predict the corresponding answer based on an image and a question,and the text-to-image generation,where the model needs to produce a high-quality image based on a paragraph of text description.Both the traditional cross-modal retrieval task and the emerging cross-modal information interaction and generation tasks are,in essence,mapping and transforming semantics of cross-modal data.A deep understanding of the data from different modalities is a prerequisite for many cross-modal applications.However,the inconsistent presentation of different modal data,characterized by the existence of different feature spaces(called heterogeneous divides),presents many challenges for the effective matching,understanding and transformation of information across modalities.Therefore,exploiting high-level semantic information from data plays an important role in many cross-modal applications.In response to the above challenges,this thesis completes the matching,comprehension and transformation of advanced semantic information of different modal information from the perspective of advanced semantic information exploiting.Specifically,this thesis starts with three typical cross-modal applications: cross-modal retrieval(image?text),visual question answering(image + text?text),and text-to-image generation(text?image),to investigate how to effectively achive matching,interaction,and generation in cross-modal applications by exploiting high-level semantic information from the data of multiple modalities.Specifically,the main contributions are summarized as follows.1.An attentive semantic disentanglement model via adversarial learning is proposed for cross-modal retrieval,which achieves accurate matching of semantic information of crossmodal data by separating the semantic features of text and image from redundant information such as context and modality,and then using only the“semantic features”of the text and image to compute similarity socore.Experimental results show that the model filters out redundant contextual and modal information,learns high-quality semantic feature representation of text and images,and greatly improves the accuracy of cross-modal retrieval.In addition,the model achieves the state-of-the-art results in four public benchmark datasets,illustrating its generalizability and superiority.2.A novel explicit attention supervision based model is proposed for visual question answering model is proposed.By adding explicit attention supervision to the traditional attention model,the visual question answering model learns more accurate attention weights and enhances the understanding of the semantics of information across modal interactions,ultimately improving its predictive performance.Experimental results on two benchmark datasets demonstrate the feasibility and superiority of explicit attention supervision.3.A novel text-to-image generation model via semantic consistency modeling is proposed that ensures that the generated image can be re-described as input textual information by designing and applying a “text?image?text”redescription generation framework,providing an explicit supervision of semantic coherence across modalities.Experimental results demonstrate that the model successfully achieves an efficient transformation of high-level semantics of different modal data,illustrating its feasibility and effectiveness in ensuring semantic consistency between generated images and input text.In addition,the method exceeds existing generation methods and establishes a new baseline.4.A novel text-to-image generation model based on multi-layered semantic information fusion is proposed,which ensures the integrity of the visual information in the generation by first encoding the input text into multiple visually-grounded text features,and then generating images by fusing these text features.The experimental results show that the use of multiple visually-grounded text features not only guarantees semantic consistency and visual fidelity,but also guarantees the quality of the generated images.The results on two benchmark datasets indicate that the model is superior to representative benchmarking methods.
Keywords/Search Tags:Cross-modal applications, Feature representation, Attention mechanisms, Deep learning, Adversarial learning
PDF Full Text Request
Related items