Research On Multimodal Model Based On External Attention Mechanis

Posted on:2024-08-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y D Zheng

Full Text:PDF

GTID:2568306917473994

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Today,with the prevalence of deep learning,a single form of data is no longer enough to support people to complete some difficult tasks.In order to make further breakthroughs in the understanding of things by artificial intelligence,people use multimodal information for interpretation and reasoning.Multi-modal information refers to information from different modes,such as images and texts.By combining different modes for comprehensive analysis and reasoning,it is possible to better simulate the way humans know and understand.This method has been successfully applied to speech recognition,image segmentation,natural language processing and other research fields.In this paper,a cross-modal encoder representation method CMEEA based on external attention mechanism is proposed to improve the accuracy of the model,which performs well in visual question answering,common sense question answering and reasoning tasks.The two external memory units of external attention in the External Attention Encoder can be viewed as dictionaries of the entire dataset to improve the performance of the network,while being able to learn more representative features of the input and reduce computational costs.Considering the linear complexity of external attention and implicitly considering correlations between all data samples,this paper employs 5 pre-training tasks that help the model learn internal and cross-modal relationships.This paper also demonstrates generalization by applying a pre-trained cross-modal model to a challenging visual inference task (1 and improves the previous best results by 0.1%.And improved by 1.3% on visual question answering(VQA).In this paper,the multi-modal image retrieval task is studied,and the probabilistic combinatorial embedding(EMPC)model based on external attention is proposed.Images and text are first learned probabilistic embedding by modal encoders through external attention,and then composite embedding is carried out using multiple modal combinations.The EMPC model aligns text embedding and target image embedding by minimizing probabilistic contrast loss.The R@5,R@10 and R＿P of EMPC in three multimodal combination image retrieval queries increased by 16.84%,18.84% and 4.69%,respectively.In this paper,the task of weakly supervised phrase grounding is completed on Flickr30 k data set,and a multi-mode alignment frame EMAF based on external attention is proposed.Information of different modes is "aligned" by calculating the similarity of data of different modes through an external attention mechanism.EMAF uses Res Net-101 as the backbone network and Faster R-CNN as the target detector,and finally achieves excellent results.The evaluation was performed on the Flickr30 k dataset with an improved accuracy of 0.8%.

Keywords/Search Tags:

Deep learning, pre-training models, multimodal models, multimodal alignment, external attention mechanisms

PDF Full Text Request

Related items

1	Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensor
2	Research On Multimodal Emotion Analysis Method Based On Deep Learning
3	Research On Visual Question Answering Technology Based On Multimodal Information Alignment
4	Adversarial Training For Universal Multimodal Learning
5	Deep Probabilistic Generative Models Based On Multimodal Variational Inference
6	Research On Multimodal Deep Learning Algorithm Based On Attention Mechanism
7	Research On Learning And Inference Methods In Deep Generative Models
8	Key Technologies For Fine-grained Sentiment Analysis Towards Multimodal Data
9	Multimodal Sentiment Analysis For Text,Audio And Video
10	Research On Multimodal Sentiment Analysis Based On Joint Learning Of Image-text Features