| Temporal Action Detection is an important research task in the field of computer vision,whose goal is to locate the start and end times of actions in untrimmed videos and to classify the types of actions.Temporal action detection techniques have been widely used in sports events,security surveillance,video teaching and other fields.In recent years,various supervised methods have achieved remarkable results in the task of temporal action detection.However,these methods usually require a large amount of labeled data as support,and the high cost of labeled data limits the development and application of temporal action detection algorithms.Temporal action detection includes two tasks: temporal action proposal generation(localization)and action recognition(classification).As the key step of temporal action detection,the accuracy of temporal action localization determines the performance of temporal action detection.Therefore,it is important to study how to improve the accuracy of temporal action localization to enhance the effectiveness of temporal action detection.To reduce the dependence on labeled data,the thesis proposes a Self-Supervised Pretraining Transformer(SSPT)method that can train an action localizer without any labeled data.SSPT designs a pretext task named "Random Query Segment Detection" as the learning target of pre-trained Transformer.Meanwhile,in order to make full use of the contextual semantic information and improve the accuracy of temporal localization,SSPT introduces the Transformer structure to model the temporal sequence context and capture the global temporal dependencies.The experimental results show that the SSPT method has the best localization effect compared with related algorithms on the THUMOS14 dataset.By comparing the image object detection and the temporal action detection task,the thesis finds that the temporal action boundaries are blurred in continuous video,and the action-background difference is small.In order to cope with the problem of inaccurate localization due to blurred boundaries in the video itself,the thesis further proposes the feature reconstruction method SSPT-Tr based on the Triplet.The method further constrains the boundary feature expression ability between action and action and between action and background by using triplet loss,which effectively improves the differentiation between action and background in terms of features,thus better improving the accuracy of temporal action localization.Compared with the benchmark method,the SSPT-Tr method greatly reduces the training time of the downstream task.The experimental results show that the SSPT-Tr method outperforms SSPT,with better localization performance,in terms of localization effect.In order to improve the efficiency of temporal action detection,the thesis designs an End-to-End Action Localization algorithm.The method is based on the self-supervised pretrained SSPT action localization method,which merges the temporal action proposal generation task and action recognition task to build a simple and efficient network framework.Meanwhile,in order to solve the problem that action boundaries are not easily distinguished due to the high similarity of adjacent temporal location features,the thesis proposes a boundary enhancement module to increase the distinction of action boundaries in the feature representation,so as to obtain more accurate temporal boundaries.Compared with current action detection methods,the end-to-end of this method achieves an advantage in efficiency due to its simplicity and efficiency. |