Font Size: a A A

Research On Key Technologies Of Video Group Activity Analysis And Recognition

Posted on:2021-04-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:D Z XuFull Text:PDF
GTID:1488306470467924Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Video analysis is the semantic understanding of video content.It is a comprehensive task and has a wide range of applications in video surveillance,event analysis,intelligent video retrieval,and human-computer interaction.Group behavior analysis is of great significance to effectively understand video content,and it is also a research hotspot in the field of video analysis in recent years.Group behavior is a comprehensive expression of individual behavior and the interaction between individuals.Therefore,the expression of individual characteristics is the basis of group behavior analysis.The thesis focuses on the task of group behavior recognition,and conducts research in the areas of target detection,target tracking,and group activity recognition.The specific content is as follows:First,the anchor-based object detection algorithm has the disadvantages of high computational complexity and poor real-time performance,and the discrete anchors have problems such as inability to cover the object area within a continuous scale.In response to the above problems,some researchers have detected multiple objects with multiple scales by allocating multiple discrete anchors.However,discrete anchors cannot cover all objects in a continuous scale range,which results in unstable performance.Some people have tried to introduce deeper and wider CNN networks and dense anchor sampling strategies.While improving performance,they have also brought greater memory space requirements and lower speeds.In fact,research has proven that feature points in the CNN output feature map can be directly mapped to the original input image to form the corresponding receptive field(RF).In addition,the RF corresponding to the output feature maps of different levels of the network layer can cover a specific size range.Neurons with the same RF can predict continuous scale objects in a certain range,instead of discrete scale,RF can be regarded as a natural anchor.Based on this,this paper proposes a fast specific category object detection method(OS-LFD: Light and Fast Detector with Ommateum Structure)based on the receptive field anchor Ommateum structure.By analyzing the correlation between the effective receptive field(ERF)and the scale of the object,a 4-branch network was designed to cover objects of continuous scale.Further,an ommateum module with a similar structure and shared parameters is designed on each network branch,which effectively reduces the number of parameters.The experimental results show that the OS-LFD method proposed in this paper can achieve higher accuracy target detection with a smaller model and faster speed,which can well balance the detection accuracy and running speed.Secondly,an object tracking algorithm for spatial-temporal structure-aware refinement network(STSAR-Net)is proposed.The network can use the sequence model GRU to learn the dependencies between the internal structures of objects.It is very discriminatory for similar interference targets and hardly adds any parameters.In addition,a spatial-temporal refinement layer based on the LSTM regression model is designed to jointly infer target historical information and refine the tracking results in the extended spatial-temporal,which can effectively alleviate the problem of target loss caused by occlusion and deformation.Comparative experimental results show that the proposed method can achieve optimal performance.Finally,a group activity recognition algorithm based on the expression of spatial-temporal attention multi-feature relation is proposed.First,we introduce an object relation module,which processes all objects in a scene simultaneously through an interaction between their appearance feature and geometry,thus allowing the modeling of their relations.Second,to extract effective motion features,an optical flow network is fine-tuned by using the action loss as the supervised signal.Then,we propose two types of inference models,opt-GRU and relation-GRU,which are used to encode the object relationship and motion representation effectively,and form the discriminative frame-level feature representation.Finally,an attention-based temporal aggregation layer is proposed to integrate frame-level features with different weights and form effective video-level representations.Experimental results show that each module can effectively improve the performance of group activity recognition.
Keywords/Search Tags:Object detection, object tracking, group activity recognition, deep learning, video analysis, Ommateum structure, receptive field anchor, spatial-temporal aware structure, spatial-temporal attention, relational expression, multi-feature
PDF Full Text Request
Related items