Event understanding is an important research direction in intelligent video surveillance,which has a wide range of applications,such as security system,criminal investigation and public management.The traditional way to understand the events in surveillance videos is manual inspection.The processing speed of manual operation is slow and the automation level is low,which can not meet the requirements of large-scale intelligent video surveillance.On the other hand,with the development of artificial intelligence technology,it has become possible for computers to automatically and accurately understand the events that occur in the video.Therefore,the technology for event understanding in surveillance videos has gradually become a popular research topic.We have studied on event understanding in intelligent surveillance video from three aspects:real-time human action recognition,anomaly detection in surveillance videos,and event retrieval based on natural language query.The main contributions of this thesis are as follows:(1)A real-time action recognition method combining three-dimensional convolutional neural network and the temporal segment algorithm.In view of the complexity of the video’s multi-level temporal structure,we proposed to utilize three-dimensional convolutional neural networks to capture singleframe visual features and model short-term information in videos.Meanwhile,we proposed the temporal segment algorithm to capture long term features in the video.In the whole process,we do not need to extract features with high computational complexity like optical flow,and sparsely sample the video in both the training and testing stages,thereby reducing the time complexity.In addition,we also adopted pruning and quantization algorithms to compress the size of the model,which facilitates the deployment of the model in real-world.The experimental results show that our method recognizes human action in real-time,and improves the accuracy of action recognition at the same time.(2)An anomaly detection method based on transfer learning.In view of the extreme shortage of training samples in surveillance videos,we transfer related common knowledge from large-scale action recognition data sets,and use the semantic similarity between the action category in action recognition and the event category in event detection to complete event detection.In addition,we observed the background-bias phenomenon that the existing model does not learn the abnormal pattern and makes the judgment based on the background information of the surveillance video.To solve this problem,we proposed a regional loss function to guide the model to focus on the abnormal area.We also proposed to use meta-learning to model the relationships among training samples,enhancing the generalization ability of the model.Experimental results show that our method improves the accuracy of anomaly detection.(3)Sentence-guided multi-stage semantic fusion method for event retrieval.Natural language query-based event retrieval adopts natural language as the query,and locates the start and end times of events described in natural language in the video.To solve the challenge of the complexity of semantic information in sentences,in the feature extraction stage,we designed an early modulation module to modulate the visual feature extraction process,which can generate visual features containing rich semantic information.In the temporal location stage,we designed the late guidance module to utilize sentence features to update the network feature map,which further integrates visual features and sentence features.Experimental results on public datasets show that our method improves the accuracy of event retrieval in videos.(4)The design and implementation of a prototype system of event understanding in surveillance videos.In order to verify the effectiveness of the algorithm in this paper,we designed and implemented a video surveillanceoriented event understanding prototype system,and used the system to verify and analyze the proposed algorithm from multiple views in actual scenarios,which verifies that this system can extract video features in real time,detect specific events,and retrieve events described in natural language.The system can be promoted and applied in many fields such as smart security,smart cities,and smart investigations. |