Font Size: a A A

Scene Recognition Based On Multi-modal Information And Global Self-attention Mechanism

Posted on:2023-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2568306836972169Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of modern intelligent systems,it is very important for intelligent systems to understand their location and surrounding environment.The purpose of scene recognition is to help the computer understand the surrounding environment,it can describe the scene category that the image belongs to,rather than just listing the objects in the scene.Nowadays,scene recognition is widely used in the fields of human-computer interaction,intelligent robots,intelligent video surveillance and autonomous driving.It has become one of the important tasks in the field of machine vision.At present,the improvement of scene recognition performance mainly benefits from the booming development of convolutional neural networks and the emergence of large-scale datasets.With the rapid development of low-cost depth sensors,scene recognition based on RGB images and depth images has become a new research direction.Studies have shown that these two modalities have good complementarity,which is helpful to promote scene recognition performance.In addition,scene images are characterized by multiple objects and complex spatial distribution,so the overall modeling of the image is not good.The attention mechanism can focus on useful information and suppress useless information in the learning process,which is very suitable for scene recognition tasks.In recent years,the emergence of the Transformer model has pushed the attention mechanism to a new height.How to use the Transformer model to promote the development of machine vision is the current research hotspot.Based on previous research,this paper investigates a scene recognition method based on multi-modal information and global self-attention mechanism.The main research contents include:(1)Extensively read literature related to scene recognition,understood the research status of algorithms related to scene recognition and attention mechanism at home and abroad,and expounded the research background and significance of scene recognition.Deeply analyzed of the current problems and challenges encountered in scene recognition,and introduced representative scene datasets.(2)Analyzed and compared the latest scene recognition algorithms based on RGB-D,and studied the classical attention mechanism method.Through experimental analysis,it is concluded that the scene recognition performance of combining RGB image and depth image is better,and it is proved that the attention mechanism is helpful to further improve the scene recognition accuracy.(3)Studied the development status of Transformer model in machine vision tasks,and introduced the basic structure of Transformer model.In order to further understand the superiority of Transformer,the key techniques in Transformer model are studied.Applied the Transformer model into scene recognition,and proved feasibility through experiments.(4)Proposed an end-to-end trainable two-channel deep neural network model,which combines multi-modal information and global self-attention mechanism to realize scene recognition,called SR-MGA(Scene Recognition Based on Multi-Modal Information and Global Self-attention Mechanism)model.The model is divided into four network modules: sequence generation network,global self-attentional coding network,feature fusion network and classification network.The two-channel structure is the same,which is composed of sequence generation network and global self-attentional coding network.The global self-attentional coding network consists of several global self-attentional coding modules.In order to solve the problem of network overfitting,Dropout is added to the residual connection of the global self-attention coding module.In addition,a lateral connection was added between the two channels to further explore the complementarity between the two modes.(5)Verified the strategies in the proposed SR-MGA model,and analyzed the performance of the SR-MGA model by experiments on SUN RGB-D and NYUD2 datasets.The results show that the performance of the SR-MGA model is far superior to other scene recognition methods,which proves the effectiveness of the SR-MGA model.
Keywords/Search Tags:scene recognition, multi-modal, global self-attention mechanism, lateral connection
PDF Full Text Request
Related items