| Along with the breakthroughs of deep learning technologies in recent years,the problems of behavior recognition based on computer vision has received widespread attention and achieved considerable progress.Behavior recognition method based on computer vision has a wide application prospects in fields such as security monitoring,medical monitoring,human-computer interaction,automatic driving and unmanned shops.At present,most behavior recognition methods can only recognize the behavior of a single person and can only recognize a limited number of behavior categories such as walking,running and falling.They cannot detect a large number of interactions between humans and environmental objects in the scene.In the complexity scene with acute background changes,behavior recognition methods that use man-craft features usually have poor robustness against environmental changes,object deformation and occlusion,leading to low recognition accuracy.Additionally,because the amount of information of the image data to be processed is large,most current behavior recognition methods based on computer vision have high computational complexity and cannot achieve real-time calculation performance.To solve the above problems,the major research works in this thesis are listed as follows:(1)For the problem of behavior recognition in videos,a Long-Short Term SpatioTemporal Visual Model(LSTVM)combining three-dimensional convolutional neural network and recurrent neural network is proposed.The method uses a threedimensional convolutional neural network to extract short-term spatial-temporal visual features and then feeds the generic short-term behavioral features into an improved recurrent neural network to extract specific long-term behavioral features.The experimental result shows that the LSTVM method achieves 87.6% accuracy on the UCF101 dataset.(2)For the improvement of interactive behavior recognition accuracy in videos,the optimization problem of interactive behavior recognition is researched based on the research work in(1)and a Long-Short Term Spatio-Temporal Visual Model with Human-Object Visual Relationship(HOVR-LSTVM)is proposed.The method uses an object detector based on convolutional neural network to obtain the semantic and spatial locational information of humans and objects,and then constructs semanticspatial locational features to fuse with the short-term spatial-temporal visual features.The experimental result shows that the HOVR-LSTVM method improves the accuracy to 92.5% on the UCF101 dataset,outperforming other state-of-the-art methods.In addition,the HOVR-LSTVM method has lower computational complexity compared with other methods based on optical flow information and the calculation speed is 125.2 frames/sec,achieving faster-than-real-time recognition performance.(3)For the problem of human-object interaction detection,a Visual-Semantic Model with Attention Mechanism(VSM-AM)is proposed to detect multiple humanobject interactions simultaneously in an image.The method includes the following three aspects: Firstly,an object detector based on convolutional neural network is used to obtain the semantic and spatial locational information of humans and objects,and a method of 3-channel spatial locational pattern is proposed to construct human-object spatial locational features;Secondly,a convolutional neural network is used to extract generic visual features of humans and objects,and an Attention Network(AN)is proposed to construct the spatial visual features;Thirdly,a word embedding method is used to encode the semantic information of objects into semantic features,and an action classifier fusing semantic features is proposed to classify the interaction behavior.The experimental result shows that the VSM-AM method achieves mean average precision of 21.30% and Top-3 recall rate of 56.9% on the HICO-DET dataset,outperforming other state-of-the-art methods.In addition,the calculation speed of the VSM-AM method is 7.8 frames/sec,achieving real-time detection performance. |