| Behavior analysis technology of specific targets in video surveillance can improve the efficiency of law enforcement personnel’s video analysis and judgment,which is of great significance in law enforcement work.The current mainstream behavior analysis methods focus on analyzing behavior in daily life scenes,and there is no specific behavior analysis method for law enforcement work scenes.Moreover,the algorithms have poor robustness and insufficient analysis content.To address these issues,this paper focuses on the research of text-guided target segmentation and behavior analysis,and the main innovations and work are as follows:A text-guided target segmentation method is proposed.Based on the Refvos network architecture,first,multi-scale image features and text features are extracted using Swin Transformer and Bert,respectively.Then,the multi-scale image features and text features are multiplied and fused to obtain multi-scale cross-modal features.Finally,convolutional long short-term memory networks are used to aggregate multi-scale cross-modal features to obtain segmentation feature masks,which are then upsampled to obtain target segmentation results.The proposed method is trained and tested on the UNC,UNC+,G-Ref,and Refer It datasets,and the experimental results show that compared with Refvos,the proposed method improves the IoU results by 0.92% and 4.1% on the UNC dataset val and test B tests,respectively,and improves the IoU results by 1.83%,0.63%,and 1.75% on the UNC+ dataset val,test A,and test B tests,respectively.The IoU results on the G-Ref and Refer It datasets reach 40.16% and64.37%,respectively.A graph network-based behavior recognition method is proposed.First,a graph network with residual structures is used to extract two-dimensional skeleton keypoint features with stronger representation ability.Then,the skeleton keypoint features are subjected to temporal analysis based on a parallel temporal analysis module to obtain action features,which improves the efficient analysis ability of long-distance temporal skeleton features.Finally,the action feature vector is input into a Softmax classifier to obtain behavior recognition results.The proposed method is trained and tested on the Kinetics and NTU-RGB+D datasets,and the experimental results show that the proposed method achieves Top1 and Top5 classification accuracy of 32.6% and 55.4%,respectively,on the Kinetics dataset,which is 1.9% and 2.6%higher than the original ST-GCN method,and achieves X-Sub and X-View indicators of 83.1%and 89.7%,respectively,on the NTU-RGB+D dataset,which is 1.6% and 1.4% higher than the original method.A Transformer-based event description method is proposed.First,a spatiotemporal difference action feature extraction module is designed to improve the efficiency of video action feature extraction and more effectively characterize the action features.Then,an event correlation module is designed using an attention map network to analyze the relationship between different events in long videos and achieve accurate event positioning.Finally,an abnormal behavior dataset is constructed to supplement the original dataset and improve the model’s ability to describe abnormal behavior.Simulation experiments are conducted on the Activity Net and abnormal behavior datasets,and the experimental results show that the proposed method improves the BLEU4/METEOR/SODA_c indicators by 0.06/0.10/0.63 on the Activity Net dataset and achieves BLEU4/METEOR/CIDEr/SODA_c indicators of1.08/5.91/15.32/3.6 on the abnormal behavior test set.Finally,the implementation and verification of the proposed method are carried out,and the above models are integrated to verify the function of the method in a simulated real scenario.The verification results show that the proposed method can accurately locate specific targets described by text in law enforcement work scenes and generate rich analysis results for the target’s behavior over a period of time. |