Font Size: a A A

Research On Multimodal Deep Learning Algorithm Based On Attention Mechanism

Posted on:2024-05-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:F DongFull Text:PDF
GTID:1528307319982069Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
The processing technology for integrating different modal information such as vision and language has made significant development in recent years,thanks to the emergence of big data processing,high-performance computing tools,and deep learning technology.Deep learning technology is one of the most essential among them.Deep learning techniques,as a new research direction in the field of machine learning,have an important role in obtaining the laws and representation levels inherent in the learned sample data,thus allowing machines to have certain analytical learning capabilities like humans.In applications such as computer vision,natural language processing,and data mining,the emergence of deep learning has greatly promoted the development of these technologies,but there is still a big gap between simulating the complex environment in which humans deal with reality.In the real world,humans with advanced intelligence are often faced with complex objective environments with multiple modal information.How to use these existing intelligent technologies to process this multimodal information has become an urgent problem to be solved.Therefore,multimodal information processing and management tasks based on vision and language have attracted more and more researchers’ attention,such as image captioning,visual question answering,etc.Image captioning aims to generalize images into text,that is,the computer automatically generates text to describe a given image.This technology can be widely used in image retrieval,image aided recognition and so on.Visual question answering requires corresponding answers to specific questions on the basis of understanding images,which has great application value in the fields of blind navigation,human-computer interaction,and automatic security.In contrast,visual question answering requires more refined processing and understanding of visual and linguistic information than image captioning.The following issues should be paid attention to when dealing with multi-modal visual question answering tasks combining vision and language under the framework of deep learning:(1)to solve the fusion problem of two modal information;(2)to utilize the fusion of modal information based on Vision and natural language produce high-level semantic understanding;(3)joint reasoning about task goals.The use of attention mechanism plays an irreplaceable role in solving the above problems.The attention mechanism can fuse the features of the two modalities of vision and language,and locate and extract effective features within or between the two modalities,so as to obtain high-level semantic understanding of vision and language.After obtaining high-level semantic understanding,the attention mechanism can also combine the corresponding task objectives for joint reasoning to obtain the results required by the task.However,there are still some problems in the attention operation mechanism in the existing visual question answering task models.First,the current visual question answering model assigns the same contribution to both visual and linguistic modal features before entering the fusion inference stage,and directly inputs them into the deep learning network for fusion processing without considering the weights between different modalities,which is obviously unreasonable.The main reason is that simulating the process of human processing multimodal tasks,the processing of various modal information in the human brain itself cannot be equally weighted,and it may be that visual images are weighted heavily or linguistic text is weighted heavily,i.e.,some people pay more attention to image features while others pay more attention to linguistic features.In short,human attention mechanisms act selectively not only within but also between modalities.Therefore,the existing visual question answering model approach inputs each modal information with the same weight into the deep learning network,resulting in each modal information having the same contribution to the task outcome,which violates the operation rules of the human attention mechanism.Therefore,inspired by the idea of normalization,this paper designs a new attention mechanism,which uses the normalization function to redistribute the weights of two modal information,visual image and verbal text,within and between modalities in the multimodal visual question and answer task,so as to make it more consistent with the intelligent characteristics of human attention and improve the attention mechanism from a bionic perspective,which is useful for improving the processing ability of multimodal information management tasks.It has certain implications for improving the processing capability of multimodal information management tasks.Second,the attention mechanism in the multimodal feature fusion phase is a weighting operation on all the visual and linguistic problem feature information extracted by the pretrained model,without fully considering the validity of this information.The high-level semantic information obtained in the modality fusion phase has a crucial impact on the performance results of the visual question answering task.Therefore,the indiscriminate input of multimodal information obtained from pre-training during the operation of existing attention mechanisms tends to have the opposite effect on the accuracy of the visual quizzing task.In order to ignore such feature information that is irrelevant to the task outcome,the attention mechanism should have the ability to screen and filter features in addition to the processing of modal features.Therefore,this paper draws on some threshold processing methods to achieve feature sparsity by screening the multimodal features involved in the attention operation.Unlike other feature sparse methods,the modal features in this method are preserved through a series of complex operations and processing,and the important part of the modal information is retained to improve the operation of the attention mechanism,thus restoring the basic functions that the attention mechanism should originally have.Third,the existing attention mechanisms in the joint inference process often ignore the contextual information of the targets in multimodal features and cannot fully and correctly answer natural language questions that require relational inference.A problem that cannot be ignored in the process of multimodal feature fusion inference is the contextual relationship of the target in the modal information,and it is crucial to correctly clarify the contextual relationship for the prediction of the answer.Co-attention uses the transformer structure to form an attention mechanism that guides visual images from the question language to handle the contextual relations in visual question and answer tasks.However,there are two nonnegligible problems with such co-attentiveness.First,the attention only considers the guidance of key linguistic features to the key targets in the visual image but ignores the guidance process of the key targets of the visual image to the linguistic text.Second,the operation process of the co-attentive mechanism does not fully focus on the important feature information of multimodal.Therefore,in order to solve the above-mentioned problems of the existing model co-attention mechanism,this paper performs a secondary attention operation on the basis of co-attention,supplemented by the mutual guidance of attention between visual images and problematic language text,to pinpoint the key targets of each modality while clarifying the contextual relationships of the targets within and between the respective modalities from the perspective of logical reasoning.It is a powerful reference for how to answer natural language questions requiring inference by using attention mechanisms to handle the contextual relations of targets in multimodal information tasks.Combined with the development history of multi-modal information management,this paper designs a visual question answering information management system based on the framework of deep learning network and the continuous research and transformation of human attention mechanism.Under the existing technical conditions,the visual question answering information management system consists of three parts: multimodal feature extraction,multimodal information fusion reasoning and answer prediction.Implemented on multiple image datasets,the system improves multimodal information management from the perspective of simulating human handling of multimodal tasks.The research and exploration of the attention mechanism in the multimodal visual question answering task,on the surface,is the continuous innovation of the attention mechanism algorithm,but in essence,it is the continuous improvement process of the "intelligence" in the concept of "artificial intelligence".It is the ultimate goal of the development of "artificial intelligence" to allow machines to handle multi-modal complex scenes like humans."Intelligence" is an important symbol that distinguishes humans from other species,and the level of attention directly affects the development of human intelligence and the absorption of knowledge.Likewise,attention plays a pivotal role in the development of "intelligence" in machines.Therefore,this paper re-examines the meaning of "intelligence" and studies deep learning algorithms based on attention mechanism in multimodal visual question answering tasks under the guidance of "intelligent evolution model".
Keywords/Search Tags:multimodal information management, multimodal information fusion reasoning, visual question answering, attention mechanism, modal information contribution
PDF Full Text Request
Related items