| Autonomous driving is an area that has attracted much attention in recent years.Many technologies of the autonomous driving system have made significant progress,thanks to the development of deep learning.The autonomous driving system is mainly composed of three parts:perception,planning,and control.The perception module is equivalent to the eyes of the autonomous driving system,which is responsible for perceiving the surrounding environment and providing accurate scene information for the planning and control module.Perception for autonomous driving scenarios is a challenging task because the working scenarios of autonomous vehicles are extremely complex.This paper focuses on monocular object perception in autonomous driving scenarios and proposes:1.An anchor-based single-stage monocular 3D object detectorThe goal of 3D object detection is to identify the categories of surrounding obstacles and their 3D bounding boxes,which are parameterized by size,location,and orientation.This paper proposes an anchor-based single-stage neural network with feature alignment and asymmetric non-local attention for monocular 3D object detection.A two-step feature alignment method is proposed to address the problem in the single-stage object detector that the receptive field of the feature does not match the anchor.And an asymmetrical Non-local attention block is proposed for combining environmental information and depth-wise feature extraction,which contributes to the accuracy improvement of the object depth estimation.The experimental results on the KITTI dataset show that the proposed method significantly improves the performance in both 3D object detection and bird’s eye view tasks.2.A monocular object referral method based on cross-modal transformersObject referral in the autonomous driving setting considers the situation where a passenger gives a command which may be associated with an object found in a street scene,such as “drive up to the right side of that truck in front of us.” The goal of object referral is to retrieve the corresponding object in the scene according to the natural language command.This paper proposes a framework using cross-modal transformers to tackle the joint understanding of vision and language.A convolutional neural network is adopted for visual feature extraction.And the encoder of transformers is utilized to learn linguistic features.Linguistic features and visual features are matched and aggregated through the cross-modal attention in the decoder of transformers for learning cross-modal representations.The experimental results on the Talk2 Car dataset demonstrate that the proposed method outperforms previous methods by a wide margin. |