Font Size: a A A

Spatial-Temporal Multi-Granularity Feature Analysis Algorithm For Visual Behavior Understanding

Posted on:2021-08-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:N ZhuangFull Text:PDF
GTID:1488306503482434Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual behavior understanding refers to describing,recognizing and understanding individual behavior intentions,human-to-human interactions in groups,and individual/group-to-scene interactions.With continuous de-velopment of the economic society,visual behavior understanding has been widely used in all aspects of life.For example,virtual reality,intelligent monitoring,content-based video commentary,and even advanced human-computer interaction have a wide range of application scenarios and potential commercial values.This article focuses on the problem of visual behavior understanding.Its input is a picture or video that reflects the behavior of an individual/group in a specific scene.Through feature extraction and modeling analysis,it outputs a behavior understanding result that meets the expected requirements,The research results cover a wide range of application scenarios and have wide application value in related industries.This article is based on the demand of visual behavior understanding.It proposes the research methods for multi-granular feature learning in spatio-temporal di-mensions.It combines research ideas that analyze the relationship between"behavior","gaze","individual","group",and "scene".It focuses on the three aspects of research:behavior prediction and analysis in the long-term visual behavior understanding,the common intention prediction of multiple bodies,and the portrait analysis of group behavior under multi-granularity.From the perspective of research methods,this paper proposes the idea of multi-granular feature learning in spatio-temporal dimensions,namely:1)Define fine-grained features;2)Define the granularity of the spatio-temporal domain to analyze features;3)Fusion multi-granularity analysis to get the final result.The method proposed are refined,systematic and versatile in nature."Refined" refers to the acquisition of detailed features and multi-ple semantic information that are difficult to capture via feature learning with different granularities with a single granularity."Systematic" means that the spatio-temporal multi-granular feature learning model is not limited to processing specific feature information.It can be applied to different feature information that meets the research idea,and finally obtain high-performance results through feature fusion."Versatile" refers to the fact that a series of novel spatio-temporal multi-granular feature learning models can be designed using this idea to resolve different visual behaviors understand-ing research problems.The main work and innovations of this article are as follows:(1)Long-term behavior prediction and analysis:We innovatively propose a focus-driven asynchronous event causal correlation analysis model,which solves the problem of behavior analysis in a large time span,so that the accuracy of behavior prediction has been greatly improved.The analysis model is based on the temporal-domain multi-granularity feature learning algorithm,as well as the analysis of the inherent dependence be-tween human concerns and human conscious behavior.1)The one-way synchronization model analyzes the video frame sequence in the temporal dimension,the model has a larger receptive field in the temporal domain,and the analysis granularity of the feature is larger,and the dependence informa-tion of the behavior feature in the temporal dimension between long distances is obtained.2)Another model adopts adaptive dynamic granularity in the temporal dimension,and uses behavioral intention events as the gate signal of the LSTM network to dynamically divide the feature analysis granularity of sequence information into the temporal dimension.The experimental results prove that the proposed analysis model can effectively integrate the synchronous features and the asynchronous features of the behavior intention event of the temporal dimension,filter the noise data,and has high accuracy in the behavior prediction.(2)The common intention prediction of multiple bodies:Our second innovative work studies the prediction of the common intention of multiple people in visual behavior understanding,defines a brand-new concept of group focus,and sets the specific task of the study as the prediction of group focus in group scenes.Based on the analysis of the relationship between individual gaze and group gaze in the scene,we propose a novel spatial do-main multi-granularity feature learning algorithm framework,called multi-granularity group gaze learning prediction framework(MUlti-granularity Group Gaze Learning and Estimation-MUGGLE),which offers a technical support for the active prediction and reasoning of group behavior intentions.The basic structure of MUGGLE includes two inference paths:1)Analyze the individual features from the global granularity,by merging the individual gaze features into a gaze flow map,and input it into the deep convolution net-work to analyze the global geometric distribution of the individual characters in the scene and the context of the related scene;2)Aggregate individual features all the way from individual granularity,innovatively use time series LSTM for spatial analysis,and robustly aggregate individual features from individual granularity through recursive network structure.The two-way network model is seamlessly integrated into a converged overall network.We have also created a database of more than 8,000 images with complete annotations.The database covers a wide range of group scenes,such as supermarkets,classrooms,public advertisements,etc.The experimental re-sults prove that the proposed multi-granular feature learning model in the spatial domain can effectively integrate the individual observer gaze features in the spatial dimension,filter noise outliers,and has high accuracy in the prediction of group concerns.(3)Portrait analysis of group behavior under multiple granulari-ties:Our third innovative work proposes a multi-granularity group behavior interaction feature learning model,which realizes multi-granularity analy-sis of group behavior intentions in spatio-temporal domain.Compared to existing methods it is more accurate and comprehensive in understanding group intention in a scene.In response to the research problem of outputting multi-granular group behavior event portraits,we designed a full-scale un-derstanding and expression of group behavior in group sports videos as a specific measurement task of the research.Our research idea is to com-prehensively analyze the interaction between "individual behavior","group behavior" and "scene" in the temporal and space dimensions.We propose a new framework called Graph-based Learning for Multi-Granularity Inter-action Representation-GLMGIR.The model includes:1)A multi-granular interactive encoding model in the spatio-temporal domain,which extracts the interactive features between the athletes in the scene in a gradual manner,and encodes these interactive features within and between teams.2)And a multi-granularity attention model in the spatio-temporal domain to analyze the description semantics corresponding to the encoding of behavior interac-tion features at multiple spatio-temporal scales.At the same time,we created a new video dataset focused on group visual behavior understanding,con-taining 6,000 team sports videos and 10,000 labeled description sentences.On this dataset,we conduct an experimental analysis of the multi-granular interactive feature learning model proposed in this paper.A large number of experiments have proved that the proposed analysis method of group be-havior interaction characteristics based on multi-granularity in temporal and space domain is accurate in automatically generating description sentences of intense confrontation team sports.In summary,this paper tackles different research problems of visual be-havior understanding,and designs a spatio-temporal multi-granular feature learning model suitable for different tasks.For long-term behavior predic-tion analysis,this paper proposes a temporal-domain multi-granular feature learning model.Aiming at the prediction of common intentions of multi-ple bodies,this paper proposes a multi-granular feature learning model in spatial dimension.For the portrait analysis of group behavior under multi-granularity,this paper proposes a multi-granular feature learning model in spatio-temporal domain.A large number of experimental results and ex-tensive theoretical analysis show that the multi-granular feature learning method proposed in this paper has advantages in different visual behavior understanding tasks.
Keywords/Search Tags:Visual behavior understanding, gaze, individual, group, scene, asynchronous, spatio-temporal, multi-granularity, group gaze, video subtitles, graph-based learning
PDF Full Text Request
Related items