SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding
Abstract
Semantic 2D maps are commonly used by humans and machines for navigation purposes, whether it's walking or driving. However, these maps have limitations: they lack detail, often contain inaccuracies, and are difficult to create and maintain, especially in an automated fashion. Can we use raw imagery to automatically create better maps that can be easily interpreted by both humans and machines? We introduce SNAP, a deep network that learns rich neural 2D maps from ground-level and overhead images. We train our model to align neural maps estimated from different inputs, supervised only with camera poses over tens of millions of StreetView images. SNAP can resolve the location of challenging image queries beyond the reach of traditional methods, outperforming the state of the art in localization by a large margin. Moreover, our neural maps encode not only geometry and appearance but also high-level semantics, discovered without explicit supervision. This enables effective pre-training for data-efficient semantic scene understanding, with the potential to unlock cost-efficient creation of more detailed maps.
Community
Proposes SNAP: Given multi-modal images with camera poses, learn interpretable 2D (multi-modal) neural maps for visual positioning (and semantic mapping), in a self-supervised manner. Fusion of ground level and overhead modalities; assumes fully calibrated (extrinsic and intrinsic) cameras, top-view is ortho-rectified. Overhead encoder is a U-Net CNN, max-pooling for fusion of feature maps (between surface-level perspective maps and BEV - Birds Eye View - maps). Ground-level (perspective) image(s) encoder: take unordered set of images, CNN gives pixel-wise features and depth (as score over depth planes, occupancy volume); multi-view fusion of features: create horizontal panes along height (parallel to ground plane), for each cell (on ground) project vertical (height-wise) points along its center into nearby images, get corresponding features (bilinear interpolated), trilinear interpolation for depth score, compute mean and variance of features, give mean and variance (with maximum occupancy score) to MLP to predict fused point feature, max-pool across points (height) to get cell’s neural map features. SSL setting: Learn features that distinguish location; given relative poses of a query (single ground view) and reference (aerial and multi-perspective view), defines a score function that scores alignment of (linear projected/downsampled with L2-norm) neural maps (SE2 3DoF transform estimates), minimize InfoNCE (noise contrastive estimation) loss (contrastive learning) - increase alignment score for ground truth and decrease it for incorrect poses. Negative mining done through 2D-2D correspondence between cells of neural map, solve for pose using Kabsch algorithm (Umeyama/Procrustes alignment), sample correspondence based on feature similarity (negative samples become harder as learning progresses). Sequence-to-sequence and aerial-to-ground mapping (during inference) done by cell matching/correspondence, sampling hypothesis, select pose with highest score. Better single-image positioning with medium and hard maps (compared to SfM + SIFT, SfM + SuperGlue, OrienterNet). More qualitative results, visualizing features of neural map by lifting to scene, semantic mapping/segmentation (small CNN over neural fields supervised with GT segmentation labels), and datasets (with distribution analysis) in appendix. From ETHz (PE Sarlin), Google.
Links: PapersWithCode, YouTube, GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper