Font Size: a A A

Research On Multi-modal Human Action Recognition Based On Features Fusion And Attention Mechanisms

Posted on:2022-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:S Q WangFull Text:PDF
GTID:2518306527477974Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Human action recognition aims to recognize and understand the action and intentions of the human body in data.It is an important and popular research topic in the field of computer vision.And it plays a vital role in robotics,human-computer interaction,and intelligent monitoring.Although having made great progress in the early research,the action recognition algorithms are still be affected by factors such as changes in illumination,scale changes,and more fine-grained actions.With the increasing diversity of action recognition data,using the complementary advantages of multi-modal data to achieve joint prediction to improve recognition performance has gradually become the key research direction of many researchers.Many existing works tend to integrate the highest-level features through score fusion,in which the deeper information interaction is limited.In addition,many researchers model long-term dependencies by increasing the depth of the network,which brings performance improvements while accompanied by high computational costs and over-fitting problems.Therefore,in response to the above problems,this dissertation extracts multi-level hybrid features to achieve deeper interaction,and adopts the attention mechanism to enhance the key information in the features,thereby achieving the ability to enhance modeling long-term network dependencies.The main research work and achievements in this dissertation are as follows:(1)This dissertation proposes a Multiple Depth-levels Features Fusion Enhanced Network(MDFFEN)to solve two types of problems.First,most existing two-stream action recognition methods tend to only integrate the prediction results of the two streams at the last level,in which the complementary characteristics of two streams is hard to be made full use of.Second,there is irrelevant noise information in the features that interferes with model training.Firstly,aiming to make more efficient use of the complementary characteristics of the two modalities data,RGB and optical flow,this dissertation proposes Multiple Depth-levels Features Fusion(MDFF)by combining the proposed Spatial-Temporal Features Fusion(STFF)module embedded in different levels of two streams to capture multi-level hybrid features,and then further fused to mine deeper hybrid features.Secondly,this dissertation designs a Group-wise Spatial-Channel Enhance(GSCE)module,which adaptively distributes weights for features in the spatial and channel dimensions respectively,and refines more discriminative weighted enhancement features then.Finally,this dissertation combines the prediction results of original two streams and fusion stream through weighted score fusion,so as to further realize the optimization in the recognition performance.(2)This dissertation proposes a Two-Stream Ternary Graph Convolutional Enhanced Network(2S-TGCEN)to solve the problem in the extraction of effective information and feature enhancement in skeleton-based action recognition.Firstly,this dissertation designs a Ternary Adaptive Graph Convolution(TAGC)module,which generalizes the graph convolution operation from spatial dimension to the temporal and channel dimensions to model the context relationships.In addition,in order to enhance the skeleton features,this dissertation designs a Graph-based Ternary Enhance(GTE)module,which unites proposed Graph-based Spatial Attention(GSA)module,Temporal Attention(TA)module and Channel Attention(CA)module to further refine the extracted skeleton features.In that case,the GSA module proposed mines the discriminative local spatial information by modeling the dependency between the node of the skeleton feature and its neighborhood.(3)This dissertation proposes a Multi-Stream Ternary Graph Convolutional Fusion Network(MS-TGCFN)to solve the problem in information interaction and joint prediction in the existing multi-modal skeleton-based action recognition.Firstly,in order to extract richer discriminative information,this dissertation constructs motion data containing inter-frame difference information and parallax data containing inter-view difference information based on joint data or bone data.In addition,in order to carry out deeper information interaction between multiple modalities,this dissertation designs a Multi-Stream Features Fusion(MSFF)mechanism,which treats different levels of features from different streams as vertices and feeds them to proposed Ternary Adaptive Graph Convolution(TAGC)module to mines different levels of hybrid features.Finally,the final classification result is achieved by combining the prediction results of the base streams and the fusion stream.In summary,this dissertation conducts in-depth research on human action recognition based on features fusion and effective attention enhancement,and proposes three action recognition networks: MDFFEN,2S-TGCEN and MS-TGCFN.The extensive experiments on multiple public datasets demonstrate the great performance of the algorithms proposed in this dissertation.
Keywords/Search Tags:Action recognition, Multi-Stream Features Fusion, Group-wise Spatial-Channel Enhance, Ternary Adaptive Graph Convolution
PDF Full Text Request
Related items