Scene understanding is a difficult problem and ultimate objective for visual computing based intelligent visual surveillance, and it has very important research significance. The traditional intelligent visual surveillance mainly studied on object detection and classification, object tracking, object matching, and object recognition, etc. With the development of computer vision technology, it is urgent to acquire the scene semantic information from the natural surveillance videos directly. So the study on video scene understanding technology based on the traditional technologies is in the stage of research now.In this thesis, we regard current urgent demand of crowd management in our social life as application background. Considering the requirement of full-time continuous monitoring for practical applications at the same time, we study the crowd surveillance oriented intelligent video scene understanding technology. We mainly solve three problems as follows:how to remove night from a night or dusk scene to meet the requirement of full-time scene understanding under a low light condition; how to obtain "how many" information of a complicated crowd with high precision and high speed; how to obtain "abnormal event" and "where" information from a complicated crowd with high precision and high speed. The main contributions of this thesis are as follows:(1) For night remobal in full-time crowd surveillance, we propose a color image based night removal algorithm. The traditional works on full-time surveillance were based on an infrared image from an infrared camera, it is rare to remove the night from a ordinary color image. To solve this difficult problem, we first formally model the color transform from night time to day time by a color estimation model. Based on this model, we present a color estimation and sparse representation fusion method for night removal in a specific scene with a single image. Meanwhile, we also present a color estimation and edge enhancement fusion method for night removal in a general scene with a single image. The experimental results show that not only the subjective visual effect after night removal, but also the objective image quality evaluation index can achieve significant improvement.(2) For crowd counting in a complicated crowd scene, we propose a situation-driven depth and color information fusion method. Most of previous works, however, only use RGB (color) information, and thus they do not work well in complex situations (e. g. heavy occlusions or changing illuminations). To solve this difficult problem, we use RGB-D information. Based on adaptive camera view recognition, we propose a scalable template matching-based method for squint-view crowd counting and a scene adaptive top-view crowd counting, respectively. The experimental results demonstrate that our method has obvious promotion comaparing to current state-of-the-art approaches on counting precision in real time.(3) For crowd scene understanding, we propose a statistical learning based weighting classification method to obtain "where" semantic information. Meanwhile, we propose a method for obtaining "abnormal event" semantic information of the crowd based on trajectory clustering of strat points and end points, together with motion pattern matching of the crowd. Currently, the works on place understanding are rare. To solve this difficult problem, we first cluster the scenes based on learning of the labeled samples. Then, we classify the scene places according to the weighting priors by statistical learning. To solve the difficult problem of recognizing the abnormal events in multiple scenes effectively, we first extract the crowd motion pattern based on optical flow and trajectory clustering. Then, we compare it to prior moiton pattern for pattern matching under the common regular abnormal events. Else we compare it to self-learned motion pattern of the current scene for pattern matching. The experimental results show our method can understand the "where" and "abnormal event" information effectively, the obtained semantic information such as "where","what", and "abnormal event" can be transferred to text for output.(4) Design and implementation of crowd surveillance oriented intelligent video scene understanding prototype system. To evaluate the effect of the proposed algorithm, we design and implement a crowd surveillance oriented intelligent video scene understanding verification system. Our goal is to implement a large-scale public place orientd crowd surveillance system. A large amount of experiments on the system show that our algorithm can effectively obtain the semantic information from surveillance scenes. The obtained information can be directly applied in semantic based event retrieval task from the massive amounts of collected videos. The system can be extensively applied in smart city, smart prison, and intelligent traffic, etc. |