Images or videos obtained from urban environments usually contain dynamic and static areas.Dynamic areas include moving objects such as pedestrians and vehicles,and dynamic ones include fixed facilities such as buildings and roads.Dynamic-to-static translation is to convert dynamic images or videos into static,that is,to eliminate the dynamic content then restore the corresponding static background.It will significantly enhance the feature matching of landmarks in dynamic environments and play an important role in the visual localization and navigation of unmanned vehicles.With the development of deep generative technology,existing methods usually employ conditional generative adversarial networks to directly learn the mapping.However,simply treating scene translation as image translation lacks efficient use of spatiotemporal features,which easily leads to blurring artifacts in synthetic static content.Therefore,in-depth research is conducted on dynamic-to-static translation and the effectiveness of improved models are verified in rich experiments.The main contributions of this paper are as follows:(1)A coarse-to-fine translation model is proposed,aiming at the loss of original details of synthetic static images and the low quality of reconstructed static content.It transforms the dynamic-to-static translation into an image inpainting through a coarse network,a shadow detection module and a refinement network.It also introduces a novel texture-structure attention to make full use of image spatial information.Dataset is constructed with CARLA simulator.Image quality evaluation and visual place recognition evaluation are performed,respectively.Experiments show that the performance indicators of the proposed model greatly outperform the existing state-of-the-art models.In addition,the proposed model is transfered to real-world scenarios,further validating its generalization performance.(2)A multi-modal translation model is proposed to enhance the utilization of semantic and image information in dynamic scenes and extract visual localization features with high robustness.It uses a dynamic-to-static semantic segmentation network,a semantic prior probability model and a static image generation network to infer static images and static semantic segmentation.For the first time,the dynamic-to-static semantic segmentation is introduced in dynamic-to-static translation task,and based on this,an image+semantic multi-modal encoding and image retrieval approach are developed.The dataset is constructed with CARLA simulator,and semantic segmentation,image quality,and visual place recognition evaluations are performed to verify the robustness and usefulness of the proposed model.(3)A video translation model is proposed to fully exploit the spatiotemporal features of video sequences and generate spatiotemporally consistent static videos.It refers to a two-stage design from coarse-to-fine,and considers dynamic-to-static video translation as a video inpainting problem.It generates rough static images and complete dynamic region masks according to the coarse network and optical flow weighted masks.And then it efficiently extracts video spatiotemporal information through a refinement optimization network,which is embedded with a temporal shift module and feature alignment enhancement.Based on the driverless simulation dataset,video quality evaluation and visual odometry evaluation are carried out,which proves the advantages of the proposed model in maintaining the spatiotemporal consistency of synthetic videos and in visual localization. |