| With the development of deep learning,how to apply it to improve people’s lifestyle is becoming a research focus.In real carriages,it is usually crowded,which makes occasions like passenger’s luggage being taken mistakenly or stolen happen frequently,besides,catching each behavior is difficult relying on human resources only,therefore,taking the behavior of picking up or taking away luggages as the reaserch perspective,it is of great significance and practical value to study the action recognition algorithm for travel scenarios.This paper collects and tags datas which simulate the real carriages,and designs and trains a series of high accuracy and real-time models on these datas.Specifically,for the human-object interaction detection algorithm PPDM,its pre-defined interaction point is not suitable for detecting luggage’s being picked up and taken away behavior,thus in this paper,the interaction point’s location is modified to suit the task,besides,a pose branch is added to enhance the new network’s detection ability of the interaction point inspired by other methods.Furthermore,to improve the performance of human-object interaction detection on luggage’s being picked up and taken away,this paper extracts the core ways that dominate the performance in human-object interaction,that is,object detection and pose estimation,and explores the factors that influence the performance such as tagging choices,data augmentation in object detection and manual feature design,machine feature learning in pose estimation separately.Finally,how model fusion influences the final performance will be explored in this paper.The experimental results show that specialized modification for interaction point’s predefined location and the extra pose branch improve the precision and recall of classification compared to the original method;object detection enhanced by data augment and pose information processed by multi-layer perceptron are all able to archive the goal that both precision and recall exceed 90%.On this basis,model fusion combining the output of object detection and pose estimation reaches best recognition accuracy compared with other methods in this paper,while it just costs a little bit more inference time. |