The checkout is a very critical place in various stores and supermarkets.It is particularly important to detect the interaction between cashiers,customers and related objects in the process of cash register in terms of standardized management of employees and security of funds.With the rapid development of deep learning technology,it has already broken through the stage of theoretical research and has truly reached the level of serving people’s lives.In the field of computer vision,the task of human object interaction detection needs to mine deeper semantic information in images,which can more directly analyze the interaction between humans and objects,and has more realistic research significance.Therefore,applying human object interaction detection to the checkout scene can provide an intuitive and effective reference for the adjustment of business strategies of enterprises.However,the current human object interaction detection task still has a lot of room for improvement based on the combination of related deep learning technologies.This thesis uses feature fusion,human pose,attention mechanism and Visual Transformer to study the algorithm of human object interaction detection at the checkout,and different methods are proposed,and the main work is as follows:In this thesis,from the perspective of spatial position relationship,extract the spatial position features of humans and objects which have overlapping areas in the image,and use them as the attention function to enhance the features of the fused global context,humans,and objects.Meanwhile,use them to calculate the score of whether there is interaction between human and object,and predict the probability of a certain interaction behavior in the checkout process.And a Multiple Feature Fusion for Checkout Human Object Interaction(MFF-CHOI)model is constructed.In addition,the spatially refined features in MFF-CHOI are used as the input of CDN(Cascade Disentangling Network)for human object interaction detection,and the position encoding of Visual Feature Extractor Module is modified,and then the Human Object Pair and Interaction Module is optimized,and the loss function after adding the cashier and customer classification is adjusted.Therefore,the MFF-CDN model is proposed to achieve a better detection effect.The key points of the human body are detected using the PyraNet model,and the finegrained pose features of the human body joints are extracted on this basis,and then the finegrained pose features of the human body are enhanced by the attention mechanism to obtain the FP-CBAM module(Fine-grained Pose for Convolutional Block Attention Module).At the same time,use the relative position encoding method to optimize the multi-head attention model,build a parallel dual-branch structure FPPT-CHOI for human object pair detection and interaction detection based on Visual Transformer,and finally use HO Pointers to match the two branches,and the detection speed and accuracy are improved to a certain extent.According to the model of FPPT-CHOI proposed in this thesis,a corresponding intelligent system is designed and implemented.The system architecture includes hardware base layer,data layer,service layer and application layer.The system function includes data management module and intelligent analysis module,which are mainly used for labeling and management of checkout data and visual display of detection results.For different checkout scenarios,a Checkout Human Object Interaction Dataset(CHOID)is constructed,which classifies cashiers and customers,and contains a total of 18 types of human object interaction,and covers images of different locations,lights,and angles,and has small targets such as receipts and bank cards,which can verify the actual effect of the algorithm in complex scenarios. |