Font Size: a A A

The Action Recognition And Gaze Following Based On Multimodality Information

Posted on:2021-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhuFull Text:PDF
GTID:2518306248986099Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapidly development of intelligent products and Internet,there is lots of data which has good and bad quality,and the data mainly is video.It is efficient in a lowcost way to check the content of videos especially actions by using intelligent algorithms.Thus,it could filter or give a early warning for some invalidate videos.Besides,currently methods recognized actions' categories based on the whole previous videos sequence,and they cannot infer humans' intention.However,it is crucial for recognition and classification to analyze intention.The ability of gaze following is inherent for humans and it help humans to better interact with things.Nevertheless,it still a difficult task for computer to simulate this ability.If computer can simulate gaze following,it would help to recognize incomplete videos.In the task of action recognition,how to use appearance information and motion information are two determined factors which could improve the precision of recognition.Appearance information could be obtained by sparse sample on video frames,and optical flows could be used to describe the motion information.However,the most of current methods ignored these information's latent connection.To solve this problem,this thesis proposed a three stream network to acquire more abundant information.Each stream adopted different modality data as input.Sampled RGB frames are used to describe appearance information,and stacked optical flows are used to describe motion information,and dynamic image is used to describe the spatio-temporal information.The dynamic image is the result of ranking pooling on RGB frames.Three stream network could obtain multiply modality action information in the video sequence for model an action.This paper showed some experiments on the UCF101 dataset,and it outperformed other previous methods because it acquired more abundant information.Currently,methods about gaze following resolved this task into two sub-tasks: saliency detection and gaze estimation.This kind of methods could detect some salient objects in the field of human's gaze direction,but they ignored the relevance among human,background and objects.Due to lack of the relevance information,large objects are more likely to be identified than small ones and it cannot handle with those images which have complex background.Aim at this problem,this thesis proposed a three stream network to obtain salient objects,gaze estimation and the relevance among human,objects and background.Saliency stream adopted original image as input.Gaze stream used human's head image and position as input.And relevance stream used an relevant matrix which contains spatial relevance among every object,human and background.Each stream outputted a vector to represent the feature of different modality information.In this paper,a large number of experiments were carried out on the Gaze Follow dataset,and the results of ablation experiments indicated the importance of relevance about understanding scene.This methods achieved the state-of-the-art performance than other methods on four indicators.Thus,three stream network can deal with the task of gaze following which in complex scene.
Keywords/Search Tags:action recognition, gaze following, temporal stream, spatial strea, saliency detection, gaze estimation, relevance information
PDF Full Text Request
Related items