Font Size: a A A

Research On Multimodal Model Based On External Attention Mechanis

Posted on:2024-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y D ZhengFull Text:PDF
GTID:2568306917473994Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today,with the prevalence of deep learning,a single form of data is no longer enough to support people to complete some difficult tasks.In order to make further breakthroughs in the understanding of things by artificial intelligence,people use multimodal information for interpretation and reasoning.Multi-modal information refers to information from different modes,such as images and texts.By combining different modes for comprehensive analysis and reasoning,it is possible to better simulate the way humans know and understand.This method has been successfully applied to speech recognition,image segmentation,natural language processing and other research fields.In this paper,a cross-modal encoder representation method CMEEA based on external attention mechanism is proposed to improve the accuracy of the model,which performs well in visual question answering,common sense question answering and reasoning tasks.The two external memory units of external attention in the External Attention Encoder can be viewed as dictionaries of the entire dataset to improve the performance of the network,while being able to learn more representative features of the input and reduce computational costs.Considering the linear complexity of external attention and implicitly considering correlations between all data samples,this paper employs 5 pre-training tasks that help the model learn internal and cross-modal relationships.This paper also demonstrates generalization by applying a pre-trained cross-modal model to a challenging visual inference task (1 and improves the previous best results by 0.1%.And improved by 1.3% on visual question answering(VQA).In this paper,the multi-modal image retrieval task is studied,and the probabilistic combinatorial embedding(EMPC)model based on external attention is proposed.Images and text are first learned probabilistic embedding by modal encoders through external attention,and then composite embedding is carried out using multiple modal combinations.The EMPC model aligns text embedding and target image embedding by minimizing probabilistic contrast loss.The R@5,R@10 and R_P of EMPC in three multimodal combination image retrieval queries increased by 16.84%,18.84% and 4.69%,respectively.In this paper,the task of weakly supervised phrase grounding is completed on Flickr30 k data set,and a multi-mode alignment frame EMAF based on external attention is proposed.Information of different modes is "aligned" by calculating the similarity of data of different modes through an external attention mechanism.EMAF uses Res Net-101 as the backbone network and Faster R-CNN as the target detector,and finally achieves excellent results.The evaluation was performed on the Flickr30 k dataset with an improved accuracy of 0.8%.
Keywords/Search Tags:Deep learning, pre-training models, multimodal models, multimodal alignment, external attention mechanisms
PDF Full Text Request
Related items