| Video recognition and localization task is the basis of video behavior analysis,which has great application value to human’s convenient life,and is an important research direction of computer vision.Compared with other computer vision tasks,localization and recognition of video actions needs to model both temporal and spatial information.It is difficult to use general two-dimensional convolutional algorithm to deal with the action detection task in videos.Traditional research of video action localization and recognition mostly depends on the twodimensional convolutional neural network and manual features,which is more complex to be implemented.This thesis aims to propose a more efficient and accurate action detection algorithm through the research of video temporal signals.A two-stage temporal action proposal network is designed for the action detection and localization task.The main research of this thesis is described as follows:First,a modified watershed algorithm in image segmentation field is innovatively applied to 1D temporal signal to form different length of proposals,which obtain a rough localization of action of the first stage detection network.In order to improve the accuracy of location,the two-stream network which can fuse the features of temporal and spatial domain is used as a binary action classifier to generate time signals for watershed grouping algorithm.We propose a Correctness Discriminator to fill the proposals that watershed proposal algorithm may omit to improve the accuracy of watershed proposal algorithm.Then,a PriorMinor ranking algorithm is proposed,which optimizes the proposed watershed algorithm to an improved Prior-Minor Watershed Action Proposal algorithm,balances the advantages of watershed algorithm and sliding window algorithm,and makes the location of the action proposal more accurate in extreme cases.We further introduce a Context Information Module,and introduce Temporal Pyramid Pooling method on this basis to model the structure of internal action instances and their extended context information,generating an enhanced global feature.The Context Information Module can extend the starting and ending features of the action segments,and ensure the completeness of the action proposal.The Temporal Pyramid Pooling algorithm can model the internal region and extended region of the proposal,so that it can process the video features more detailly and improve the accuracy of the action location task.The second stage of our action detection algorithm TCR,Using multi-task learning mechanism,simultaneously realizing action localization and recognition.The classification algorithm classifies the proposals into action and background so that it filters out a large amount of redundant proposals.The regression algorithm refines the localization of the proposals and obtain a more accurate action boundary.We present unit level feature generated by the C3D network for the training of the TCR algorithm,which is different from training in single frame.The method greatly improves the accuracy and the training efficiency at the same time.Finally,we propose a new designed action classifier which integrates temporal intersection over union.By using the joint loss function,it can overcome the problem that the traditional action classifier may give a high score to the inaccurate action proposals,which leads to the inaccurate localization.The new designed action classifier can further improve the accuracy of key action recognition in some special situations.After experimental verification,on two large scale benchmarks THUMOS2014 and ActivityNet dataset,our approach PMWAP+TCR achieves superior performances compared with other state-of-the-art systems,indicating that our method can improve the precision of action localization task efficiently.The proposed networks structure and experimental methods also have reference significance for video action location and recognition. |