| Temporal Action localization in the video has attracted more attention from researchers in recent years,which aims to get the starting(ending)time and category of the action instance from the untrimmed videos including detection and recognition.Because the background of the videos from natural scenes is complex and the length of the video instance is variable differ from image,so temporal action localization is a very difficult task.In this thesis,the research progress of temporal action localization is studied and summarized comprehensively,and two methods based on deep learning are proposed with fully supervision and weakly supervision as the research direction.Aiming at problem that the fully supervised method does not utilize features completely and,is difficult to detect actions of different durations,an end-to-end action localization network is proposed based on graph convolution in this thesis.Firstly,a backbone I3D-FPN for extracting features is established.Compared with the C3 D network,it not only has a deeper network,but also obtains a larger contextual receptive field.Then,we design a module named LFF(Local Feature Fusion),which can aggregate features from baseline and use two fully convolution layer as feature extractor.Finally,a module TPGC(Two Pathway Graph Convolution)is demonstrated,which dynamically obtains contextual information and highlevel semantic information of neighbors through graph convolution and enhance the proposals’ awareness of context information.Aiming at the problem of using background instances as foreground confusion training in weakly supervised learning method,a two stage action localization network is designed in this thesis which aims to suppress background and align features.Int the first stage,in order to solve the problem that the background is difficult to model in weakly supervised detection and filter the redundant information from input feature through attention weight,a weighted label is proposed,which distinguishes the background label in the base and suppression branches.For the proposals are not accurate in the test stage,we use Soft-NMS instead of NMS to get more accurate proposals.In the second stage,we propose Align Net which aims to align proposals’ size and mine background information.Firstly,3D Ro I-Align structure is proposed which normalizes the size from input proposals through fast trilinear interpolation and remain more detailed information.Then,we add the useful information from background to foreground through two layer graph convolution to distinguish the background,and explore the relationship between the proposals and background to improve the accuracy To verify the performance of the two methods,compared experiments were conducted with related methods on two benchmark database THUMOS’14 and Activity Net.The results in this article have surpassed the baseline performance on both database.The fully supervised method surpassed the best network G-TAL in 2019 on THUMOS’14 with the score of 58.3 when m AP=0.5.The weakly supervised method also has a comparable result compared with the latest network and prove its robustness and effectiveness. |