| Visual place recognition(VPR)is the fundamental module that coarsely estimates the place where an image was taken based on a set of previously visited locations,assisting mobile robots in position identification for autonomous navigation in the real world.VPR is often characterized as being able to achieve independent localization with a prior map as well as to accomplish loop-closure detection in Simultaneous Localization and Mapping(SLAM).The robotic capabilities are often tightly coupled with the operating environment,hence the general aim of VPR systems is often stated to be invariance to changes in viewpoint(variant patterns under different perspectives)and changes in appearance(structural and illumination changes in the time of the day,seasonal changes and weather changes).Solving these challenges is essential to enable robust VPR systems,in light of the theoretical and practical significance,VPR has been the focus of much research in the computer vision and robotics communities over the last decade.A key component of VPR systems is the ability to recognize previously seen places under various conditions.However,the problem of appearance changes becomes even more challenging when simultaneously affected by viewpoint variations,which can negatively affect the performance of real-world robotic systems.Note that the existing methods increasing viewpoint invariance will inevitably reduce some degree of appearance invariance,it is arguably challenging to deal with place recognition under the circumstance of severe appearance and viewpoint changes.This paper mainly focused on the recognition of places through requiring feature invariance to both viewpoint and appearance,we conduct in-depth research on feature fusion,hybrid model,retrieval strategy,and training optimization with deep learning theory.First of all,the joint embedding method of visual,structural,and semantic features is proposed to fuse multiple features from multiple pre-trained models through unsupervised learning,effectively alleviating the effects of appearance changes and viewpoint differences.The hybrid neural network model is introduced to enable end-to-end learning of visual,structural,and semantic features,multi-scale features fusion is performed on the hierarchical output of the hybrid model with weakly supervised learning to improve the robustness of the model under severe appearance changes and viewpoint differences.On the basis of the aforementioned single-stage retrieval,the attention-guided two-stage hierarchical retrieval framework is suggested to reduce the false matches using a geometric verification step based on local weighted feature matching.In view of the shortcomings of fixed margin of triplet loss and coarse-grained supervised information under the weakly supervised learning,adaptive triplet loss and fine-grained region-level supervised learning strategies are developed to avoid suboptimal convergence of the learning.As a result,the robustness and accuracy of visual place recognition is improved through fusing traditional machine learning and deep learning methods.A wheel-leg hybrid hexapod robot is developed,and the mobile robot localization system is designed.Meanwhile,the corresponding image collecting,image pre-processing and image matching systems are developed.Finally,the effectiveness of the above three visual place recognition methods is demonstrated on this experimental platform.The main research works are as follows:(1)A joint embedding method based on Convolutional Neural Network(CNN)features and semantic topological graph.A pre-trained semantic segmentation model is used to extract image landmark regions and obtain the corresponding label information,semantic topological graphs are introduced to encode the spatial relationships of landmarks,and random walk descriptors are employed to characterize the topological graphs for graph embedding.The use of landmark semantic labels as the mask to filter dynamic landmarks,ensuring the dynamic invariance of feature embedding.A pre-trained CNN is applied to learn the visual features of distinctive landmarks,and the feature fusion scheme is introduced to achieve the joint embedding of visual,structural and semantic features.Experiments are conducted on public datasets and engineering prototype,the maximum recall of the proposed method is84.33% at 100% accuracy on the Gardens Point dataset.And the 0)(8(67)7)@10 scores are38.5% and 31.2% for the apartment and the school building scenarios,respectively,which shows that the proposed method can effectively alleviate the effects of appearance variations and viewpoint differences.(2)An end-to-end feature learning approach based on a hybrid CNN-Transformer framework is proposed to achieve end-to-end embedding of visual,structural,and semantic features.A hybrid neural network model is introduced to combine the complementary advantages of CNN and Transformer feature embedding.The network utilizes the feature pyramid based on CNN to obtain the detailed visual understanding,while using the vision Transformer to model image contextual information and aggregate task-related features dynamically,followed by fusion of the hierarchical output features of the hybrid model to encode multi-level geometric features.To acquire the multi-scale semantic information,a global semantic Net VLAD aggregation strategy is constructed with the help of prior knowledge.A differentiable neural architecture consisting of a hybrid CNN-Transformer feature extraction network and a global semantic aggregation layer is trained end-to-end with weakly supervised learning.The weakly supervised learning optimization approach that fuses adaptive triplet loss and fine-grained supervised information is proposed.To solve the problem of high intra-class variation of positive sample pairs in the feature embedding space due to the fixed margin of triplet loss,an adaptive triplet loss with a dynamic margin is proposed based on the statistics of all triplets.Extensive experiments are conducted on public datasets and engineering prototype,the 0)(8(67)7)@1 scores of the method on the Pitts30 k test,Pitts250 k test,and Tokyo24/7datasets are 86.3%,87.5%,and 78.9%,respectively.In addition,the 0)(8(67)7)@1 scores on the apartment and school building scenarios are 70.2% and 48.7%,respectively.It shows that the learned features are robust to appearance and viewpoint changes and achieve promising performance.(3)A Transformer-guided two-stage hierarchical retrieval architecture is proposed for coarseto-fine image matching with geometric verification for reranking.The global Transformer encoder of the hybrid CNN and Transformer feature extraction network is improved to a duallevel Transformer encoder,successively performing self-attention within local windows and global extent of the CNN feature map to obtain multi-scale spatial context,which combines local interaction and global information.Moreover,a Transformer-guided geometric verification module is introduced to leverage the strengths of the hierarchical Transformer’s inherent self-attention mechanism for fusing multi-level attention,which is employed to filter the output token to obtain key patches.A weighted mutual nearest neighbor matching strategy is proposed to cross-matching of key local descriptors with the associated attention scores.A weakly supervised fine-tuning strategy that fuses fine-grained supervisory information is proposed,which obtains candidate regions based on the patch descriptors learned from image-level supervision,and gets fine-grained labels by matching the feature keypoints of the candidate regions.The fine-grained region-level supervisions is exploited to further enhance the capability of the network to learn local discriminative features,which effectively alleviates the confusion caused from weak image-level labels.Extensive experiments are conducted on public datasets and engineering prototype,the 0)(8(67)7)@1 scores of the proposed method on MSLS val,Nordland test,Pitts30 k test and Tokyo24/7 are 89.2%,57.9%,90.1% and 88.2% respectively.Besides,the 0)(8(67)7)@1 scores in the apartment and school building scenarios are 84.5% and 77.5%,respectively.It demonstrates the effectiveness of the proposed method. |