| Temporal action localization requires not only identifying the categories of actions occurring in the video,but also locating the start time and end time of the actions occurring in the video.Deep learning-based temporal action localization could be divided into fullysupervised action localization and weakly-supervised action localization.Fully supervisedaction localization requires the category of action instances in the video and the time boundary of action occurrence during model training,while weakly supervised action localization requires only the video-level action category labels during training,which saves manual labeling cost and reduces manual errors.Due to the lack of frame-level label,weakly supervised action localization is usually combined with attention mechanism to improve the accuracy of action localization.However,there are also two common problems:1)Since deep neural networks combined with attention mechanisms are only able to recognize highly discriminate action regions in the video,the model usually cannot identify regions in the video where action features are not obvious,leading to incomplete action localization.2)The traditional weakly supervised action localization based on attention mechanism distinguishes action frames and background frames by modeling the action and background in the video,but there are some context frames in addition to action frames and background frames in the video,and the context frames are related to the action category,which can easily cause the action-context confusion and lead to inaccurate action localization.To address the above two problems,this paper conducts research on weakly supervised action localization based on attention mechanism,and the specific work is as follows.(1)A weakly supervised action localization model based on attention mechanism(Weakly Supervised Action Localization based on Attention Mechanism,WSAL-AM)was proposed.Firstly,attention mechanism is utilized to model the action-background relationship and extract action and background attention for the separation of action frames and background frames.Subsequently,semi-soft thresholding is applied to the action attention to extract semi-soft attention maps,which guides the model to identify video frames with obscure action features.Finally,post-processing is performed by thresholding the action attention to complete the action localization.Compared with other weakly supervised action localization models on the THUMOS14 and Activity1.3 public datasets,the proposed model achieves an average precision(mAP)of 30.8% and 37.0%,when intersection over union(IoU)threshold is 0.5,which are better than other weakly supervised action localization models.(2)A weakly supervised action localization model fused with attention model and generative model(Weakly Supervised Action Localization fused with Attention Model and Generative Model,WSAL-AM-GM)was proposed.Firstly,attention model is utilized to model the context frames in the video,and extract the attention scores of context frames to separate action frames and context frames.Then,in order to optimize the value distribution of attention model,two-stream features of video and contextual attention scores are input into a conditional variational autoencoder.Finally,post-processing is performed by thresholding the action attention to complete the action localization.Compared with other weakly supervised action localization models on the THUMOS14 and Activity1.3 public datasets,mAP@0.5 of our proposed model achieves 32.6% and 38.6%.To verify the effectiveness of the proposed model on power distribution room inspection behavior localization,experiments are conducted on a self-made power distribution room inspection behavior dataset,and the mAP@0.5 reaches30.8%,which is 6.8 percentage points higher than other power distribution room inspection behavior detection models.(3)The temporal action localization system based on deep learning was developed.The system was designed based on the browser-server architecture,and consisted of two parts: web service and algorithm service.The algorithm service encapsulated the action localization model through the Danjo framework The web service was respectively developed using LayUI and SpringBoot frameworks for front-end and back-end development,in order to visualize the action localization results. |