Font Size: a A A

Research On Transformer-based Person Detection Algorithm For Overhead Fisheye Images

Posted on:2024-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2542307118950989Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Smart buildings,as one of the important applications of artificial intelligence,require the use of sensors to intelligently sense the inside of the building.Object detection techniques for monitoring and sensing surrounding scenes through cameras have been widely researched and applied.The detection of the person by smart cameras can be applied in shopping mall pedestrian flow analysis and office intelligent management.Emergencies such as fire or earthquake can be evacuated or rescued precisely according to the disaster situation and the number of people in different areas.Traditional cameras are generally wide-angle lenses,which can only provide a local scene view.Fisheye cameras with a 360° field of view can provide a global scene view and are therefore widely used for intelligent scene monitoring.However,there are many difficulties in applying the object detection technique to detect a person in fisheye images.The unique radial geometry of fisheye images makes the angle of the person rotate,and the barrel distortion makes it difficult to extract features of small objects.The research improvement direction of this paper is to enhance the network model feature expression capability,improve the performance of the fisheye image person detector,and obtain higher detection accuracy based on having advanced detection real-time.To address the above issues,the main research work in this paper is divided into the following three points.To address the problem of difficult extraction of personal features in arbitrary orientations,this paper improves on the feature extraction capability of the backbone network.It is demonstrated by activating feature map mapping that Swin Transformer extracts more distinct and focused features with clearer contours than convolutional neural networks.For the problem that the rotating bounding box of the person in any orientation is difficult to predict and the current mainstream detection head has poor realtime performance,this paper adopts a detection head(Oriented Rep Points)that predicts the orientation and position based on a set of adaptive points,in which the adaptive points can capture the geometric orientation information of the object.In this paper,we combine Swin Transformer and Oriented Rep Points to propose a Transformer-based person detector for fisheye images,achieving 89.5% advanced accuracy on the difficult fisheye dataset CEPDOF.To address the problem that the traditional Transformer cannot directly extract orientation features,this paper introduces group-equivariant convolution to extract groupequivariant features in multiple orientations and proposes an aggregation module to aggregate group-equivariant features in multiple orientations to further enhance orientation features.Since the correlation of group features in different orientations is different,this paper proposes a group relation module based on global group self-attention to calculate the weight of each group.In this paper,a Rotation-equivariant Transformer backbone network is proposed by combining the group equivariant convolution layer with the window attention through the aggregation module and the equivariant group relation module.It is demonstrated that the accuracy of the Rotation-equivariant Transformer person detector is improved by 0.3%,0.5%,and 1.3% on the fisheye image datasets MWR,HABBOF,and CEPDOF,compared with Swin Transformer.To address the problem that the large local receptive field of the Swin Transformer affects the small object detection results,this paper designs a multi-level receptive field structure with a combination of convolution and window attention to extract features,which enhances the ability of the network model to extract local small object features.In the CEPDOF dataset,the accuracy of small object detection is improved by 0.73%.
Keywords/Search Tags:Overhead fisheye image, Person detection, Swin Transformer, Group-equivariant convolution, Feature aggregation
PDF Full Text Request
Related items