The mechanical manufacturing industry is the cornerstone of China’s industrial sector.With changing market demands,the mechanical manufacturing industry urgently needs to shift from large-scale standardized production to customized production of multiple varieties and small batches.Human-computer interaction and artificial intelligence technology based on deep learning provide possibilities and means for this transition.In human-computer interactive collaborative manufacturing,intelligent robots detect and understand human behavior and intentions through behavior recognition,realizing highly customized and automated production.This article establishes a human behavior recognition algorithm based on video streams to recognize human movements and finally verifies the performance of the algorithm through testing on public datasets and actual scenarios.The behavior recognition algorithm in this article consists of three parts:human target detection model,human pose estimation model and human action recognition model.Based on the YOLOv4 network,the ABYOLOv4(ASPP+Bi-FPN+YOLOv4)human target detection model was constructed to detect the location of target humans in the image.First,in order to adapt YOLOv4 to the human target detection task,the multi-class detection model was simplified to a single-class detection model that only detects humans.Then,aiming at the problems of low detection accuracy and serious missing detection of medium and small scale human targets in complex visual scenes,the ASPP module was introduced on the basis of YOLOv4,and the middle layer convolution input was increased to establish a double Bi-FPN.The model performance was verified on public datasets.The results show that it has higher accuracy and lower model size,achieving a balance between accuracy and model size.Based on the TransPose network,the VTTransPose(V block+twin attention+TransPose)human pose estimation model was constructed to estimate the coordinates of human keypoints at the target location in the image.First,a sparse representation of a self-attention mechanism was introduced into TransPose to reduce computational cost and improve network efficiency.Then,a layer-inner feature fusion module V block was constructed to enhance the network’s ability to locate keypoints.The model performance was verified on public datasets.The results show that the VTTransPose model has higher detection accuracy,lower model size and computational cost,and can accurately locate human keypoints.Based on the PoseC3D network,the TPoseC3D(TPN+PoseC3D)human action recognition model was constructed to generate stacked three-dimensional keypoint heat maps and realize human action recognition.Aiming at the problem that actions with similar time rates are difficult to distinguish,a Temporal Pyramid Network(TPN)was constructed and introduced between the backbone network and the prediction head of the original PoseC3D to fuse features of actions with different visual rhythms and enhance the action discrimination ability of the network.The experimental results show that TPoseC3D can perform well in human action recognition tasks.By combining the above three parts of the model,actual scene experiments were carried out.10 action categories were collected in different scenes using cameras for experiments.The results show that the ABYOLOv4 human target detection model has good overall detection effect and is not easily affected by changes in human scale,but missing detection will occur when humans overlap in a large range.The VTTransPose human pose estimation model has a good detection effect in dealing with changes in human scale,angle and slight occlusion,with strong robustness,but when there is a large range of occlusion,the prediction of keypoints is inaccurate and the position of keypoints fluctuates.The TPoseC3D human action recognition network has high recognition accuracy for actions with large limb changes,and can correctly recognize action categories even when some historical information is lost,with strong robustness. |