Font Size: a A A

Eiffcient Recognition Of Realistic Human Actions

Posted on:2013-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q X WuFull Text:PDF
GTID:1118330374476410Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
During the past few decades we have witnessed an explosion in the production of videodata due to the advancement of information technologies. It has been more and more importantto automatically understand video content for many applications. Since human action is oneof the most predominant parts in video content, many significant studies have been carried outin the computer vision domain. However, most of existing algorithms focus on the datasetsacquired in well-controlled settings, which prevents those techniques from being utilized inmore realistic scenarios. In this thesis, we investigate the problems in this emerging direction,realistic human action recognition, including multimodal data fusion, discriminative featureselection, structure context based feature extraction.Though promising results have been achieved for human action recognition under wellcontrolled conditions, such recognition tasks remain very challenging in realistic scenarios dueto increased difculties such as dynamic backgrounds. In our dissertation, one of the most im-portant contributions is to take multi-modal (i.e., audio-visual) characteristics of realistic humanaction videos into account in human action recognition, since in realistic scenarios audio sig-nals accompanying an action generally provide a cue to the existence of the action, such as thesound of a phone ringing (phone-ringing) to the action of answering a phone (answer-phone).In order to cope with the diversity of various audio cues for an action in realistic scenarios, wepropose to identify efective features from a large number of audio features with GeneralizedMultiple Kernel Learning (GMKL). At the final stage, a decision level fusion strategy withfuzzy integral is utilized to leverage recognition results from both audio and visual modalities.Better recognition performance is achieved and how audio context influences realistic actionrecognition is observed.Space-time Interest Points (STIPs) have been successfully utilized for human action recog-nition. However, there are a large number of interest points irrelevant to represent a specificaction in realistic scenarios. In this dissertation, we also propose to prune those irrelevant in-terest points so as to reduce computational cost as well as improve recognition performance.By taking human perception into account, attention based saliency map is employed to choosesalient interest points which fall into salient regions, since visual saliency can provide strongevidence for the location of acting subjects. It is demonstrated that our proposed method iscomputationally efcient while achieving improved performance on recognition of realistic hu-man actions.A large amount of works have been conducted on representing diferent human actions in diverse realistic scenes. In the bag-of-features model, human actions are generally representedwith the distribution of local features derived from the keypoints of action videos. Various localfeatures have been proposed to characterize those key points. However, the important structuralinformation among the key points has not been well investigated yet. In this dissertation, wealso propose to characterize such structure information with shape context. Therefore, eachkeypoint is characterized with both its local visual attributes and its global structural contextcontributed by other keypoints. Our proposed approach accounting for structural informationis more efective in representing realistic human actions. In addition, we also investigate theimpact of choosing diferent local features such as SIFT, HOG, and HOF descriptors in humanaction representation. It is observed that dense keypoints can better exploit the advantages ofour proposed approach.The contributions of this thesis can be summarized as follows.(1) We propose to takeaudio context into realistic human action recognition so that multi-modal features of human ac-tions in movies are exploited for better recognition performance. To the best of our knowledge,it is the first research study on multi-modal realistic human action recognition.(2) We identifythe GMKL algorithm to select efective features from a large number of audio features.(3) Wepropose to employ Fuzzy Integral fusing multi-modal features at decision level for better recog-nition performance.(4) We employ the attention based saliency map to choose salient interestpoints which fall into salient regions, since visual saliency can provide strong evidence for thelocation of acting subjects.(5) We present a novel feature representation approach to charac-terizing the structure information among keypoints with shape context. Then each keypoint ischaracterized with both its local visual attributes and its global structural context contributedby other keypoints.
Keywords/Search Tags:Human action recognition, Multiple kernel learning, Decision level fusion, Space-time interest points, Saliency map, Shape context
PDF Full Text Request
Related items