Font Size: a A A

Research On Human Action Recognition Based On Audio-Visual Information

Posted on:2014-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:S K ZhouFull Text:PDF
GTID:2248330398460773Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Human action recognition has been an active research field due to its broad application perspectives in Human-Computer Interaction (HCI), robotics, content-based video indexing, video surveillance, and so on. It is a very challenging problem because of the complexity of natural environments (cluttered background, varying illumination), variable poses of the same actions performed by different agents with different contours of body, different clothes, different manners and speed under different viewpoints. Thus, the key question is how to extract effective features from the huge amount of information of the videos or images, that is, how to utilize simple, fast and effective models to represent the different behaviors of the human body in the tasks of human behavior recognition in complex natural environment, and meets the real-time and robustness requirements at the same time.This dissertation studies human behavior recognition, which is focused on local spatial-temporal features of audio-visual information. On the basis of analyzing large numbers of human behavior recognition algorithms, the dissertation carries out3D Harris corner detector to detect spatial-temporal points in the video chip from KTH action data sets and3D SIFT descriptor is used to describe the spatial-temporal points. It can classify or identify six simple human behavior classes. For YouTube or HOHA dataset with complex dynamic background situations, we propose to take full advantage of the information in video, i.e. visual, motion and audio information to discriminately represent the person’s behaviors. The major work is as follows:Firstly, the thesis introduces the background and significance about human behavior recognition and reviews the state-of-art of vision-based human behavior recognition and the main problems. The main work and framework of this work is given.Secondly, existing methods of feature extraction and description and models for describing human behavior based on visual information are simply analysed. Common actions datasets are also described.Thirdly, research on human behavior recognition in case of background is simple, non-blocking has been done. After the analysis and comparison of various visual detectors and descriptors, the local spatiotemporal interest points are first detected with3D Harris corner detector. The3D SIFT descriptor vectors around these interest points are then computed and quantized into visual words. An action video is represented by the statistical distribution of the bag of visual words (BOV). In this paper, an incrementally learning classifier for human action recognition is designed with OS-ELM. With the ability to handle the sequentially arriving data chunk-by-chunk with fixed or varying size, the OS-ELM classifier can achieve an improved recognition performance. A comparison of OS-ELM with ELM and support vector machine (SVM) is undertaken on the KTH dataset and the performance of OS-ELM classifier is shown to be superior for human action recognition. Such method algorithm is relatively simple and is appropriate to occasions of relatively simple unobstructed background.Fourthly, human behavior recognition with a complex background and blocking problems is further studied, we propose to make full use of the information in the video and study visual, audio and motion information fusion based human behavior recognition. For visual information, Cuboid detector is used to detect interest points from the interesting region of video. Then the interest points were calculated on each video block via LBP-TOP descriptor. Motion features of human were described by Tracklet descriptor in the video. For audio information, we extracted14kinds of audio features in frequency and time domain. Then, we apply feature level fusion, decision-level fusion and mixed fusion method in HOHA data set and YouTube data set to fuse the three festures to recognize human behavior under the background of the complex dynamics. Experiments proved that the method we proposed achieves a better performance of human action recognition under complex dynamic environments. At last, IPCA-ELM was proposed to help ELM to improve the classification performance by optimal input weights and biases of ELM. Compared to GSO (Group Search Optimization) and PSO (Particle Swarm Optimization) algorithm, IPCA does not exist local extreme problem, and can handle multi-peak problems in mathematics. Confirmed by experiments, ELM optimized by the algorithm can effectively improve the recognition performance.Finally, conclusions are given with recommendation for future work.
Keywords/Search Tags:Human action recognition, Visual feature, Bag of visual words, ExtremeLearning Machine, Audio festures, Motion feature, Information fusion, Immune polyclonal algorithms
PDF Full Text Request
Related items