| Weakly supervised temporal action detection aims to locate the starting and ending boundaries of action instances in video and judge their category attributes.The existing algorithms still have two problems.Due to the lack of fine-grained temporal labeling,the models only capture the most distinctive regions,resulting in incomplete positioning.Secondly,because of the high similarity between the targeting action and its context(i.e.,video clips with certain relations to the aiming action),this results in an excessive-long localization interval,wrongly including redundant video frames,which is addressed as context confusion.To solve these two problems,three algorithms are proposed in parallel.As preprocessing,an input video with video-level labels is divided into multiple video clips at equal intervals,then these three algorithms are carried out respectively as follows:Firstly,a algorithm based on complementary adversarial mechanisms is proposed.In the adversarial stage,the boundary regression process strips the overlapping area generated by action localization in adjacent clips.This prevents the inclusion of weak correlation video frames,thus enhancing the exclusivity and adversarial nature of the content of different video clips.On the other hand,in the complementary learning stage,the action proposal segments generated by the above adversarial regression are spliced into the reconstructed video,measuring the similarity between the original and the reconstructed videos.Its error feedback to the adversarial module,which prevents it from excessively rejecting valid video frames,the two branches inter-coordinate each other and achieve a balance,so the integrity and accuracy of action proposal positioning are ensured through the above iterations.Secondly,we propose an algorithm based on self-attention relationship modeling and context suppression.Self-attention modeling is utilized to extract clip features to obtain more discriminative category scores,and the top-k strategy is used to integrate category scores to achieve localization integrity.The auxiliary context class is added to learn the potential difference between action and context,which can suppress the interference of context,and strip redundant video frames.Finally,according to the idea of "different paths return to the same place",a collaborative algorithm of temporal modeling and modal reinforcement is proposed,in which the temporal modeling branch uses the global visual perception ability to enhance the discrimination of segment feature representation and improve the model detection ability.The sparse graph constructed by the modal reinforcement branch focuses on learning the motion feature representation of the optical flow mode,and also models semantic relationships between segments,which can highlight the feature representation of the action region.A cooperative loss function is constructed to constrain the convergence of the two branches to approximate the actual interval of action.This will enable accurate and complete positioning.The above three algorithms have been extensively verified by experiments.Among them,the detection performance of scheme one on THUMOS14 and Activity Net1.2datasets is 64.68% and 42.94%,respectively.The second scheme is 66.23% and 41.43%,which is 1.51% lower than the first scheme on Activity Net1.2.The detection performance of scheme three is higher than that of scheme two but lower than that of scheme one on Activity Net1.2,that is,69.1% and 42.0%.The experimental results show the effectiveness of the proposed algorithms,and the superiority of the proposed algorithm is proved by comparing it with the latest methods. |