Font Size: a A A

Research On Multimodal Named Entity Recognition Model Based On Deep Learning

Posted on:2024-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y R SongFull Text:PDF
GTID:2568306944953739Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Knowledge graph,as an important branch of knowledge engineering,structurally describes concepts and their relationships in the physical world in a symbolic form.Named entity recognition(NER)aims to extract entity-oriented knowledge from text and integrate it into knowledge graphs.In recent years,with the development of artificial intelligence and the arrival of the big data era,a massive amount of multimodal data has been generated.Processing and mining these data can help people better understand textual content and extract valuable information.Multimodal NER combines information from multiple modalities to achieve more accurate entity recognition tasks and provide richer entity types for knowledge graph construction.With the rapid advancement of deep learning,neural network-based multimodal NER models that utilize images to recognize named entities in social media texts have become a research hotspot in recent years.Although these models have made certain improvements,there are still two main issues:(1)Ineffective exploration of semantic features in individual modalities,leading to the partial neglect of information and insufficient utilization of different modalities’ semantic information for fusion,thereby affecting entity recognition.(2)When there is inconsistency between the visual objects detected in images and the number or types of entities in the text,biases caused by visual objects may mislead entity recognition.To address the problem of ineffective exploration of semantic features and insufficient multimodal interaction,a multimodal interactive NER model based on semantic enhancement of images and text(MIITSE)have been proposed in this thesis.A representation dictionary is constructed using a social media corpus to enhance the extraction of textual features with knowledge.A hybrid architecture combining convolutional neural networks and visual Transformers is employed to comprehensively consider both global and local information during image feature extraction.A multimodal interactive module with cross-modal attention mechanisms is utilized to extract entity-relevant features from both images and text and better fuse multimodal information.Finally,a multimodal representation based on attention is applied to label the entity types in the text.Regarding the issue of entity recognition being misled by visual objects when there is inconsistency between the visual objects in images and the number or types of entities in the text,a multimodal fusion NER model based on debiased contrastive learning(MFDCL)have been proposed in this thesis.It incorporates a multimodal fusion module with cross-modal gating mechanisms to capture various semantic relationships among multimodal semantic units.In the contrastive learning process,a difficult samples mining strategy and debiased contrastive loss are employed to alleviate biases caused by inconsistencies in the number and types of entities between images and text.Finally,the learned semantic space is combined with the Global Pointer decoder to identify entities in the text.Lastly,experiments are conducted on the Twitter-2015 and Twitter-2017 datasets in the domain of social media to compare the proposed MIITSE and MFDCL models with baseline models,demonstrating the feasibility of these models.This indicates that improving the quality of image-text feature extraction and mitigating biases caused by visual objects have a positive impact on the accuracy of named entity recognition tasks.
Keywords/Search Tags:Multimodal Named Entity Recognition, Deep Learning, Feature Enhancement, Attention Mechanism, Contrastive Learning
PDF Full Text Request
Related items