Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding
Abstract
Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system's potential for planning.
Community
Proposes LEXIS (Language Extended Indoor SLAM): LLMs for real-time scene understanding and place recognition; build a topological SLAM graph (using visual inertial odometry - VIO) with CLIP features in nodes; 3D scene graph (3DSG) through knowledge from LLMs. Uses VILENS for VIO, video stream from RGB cameras, and list of potential room classes; output is CLIP-enhanced topological map. Front end builds incremental pose graph (equally-spaced keyframes); store CLIP image encodings and AKAZE local features per node. Initial room segmentation through cosine similarity (bag of text classes, pass through CLIP); label propagation for smoothness and consistency with neighbors, add stairs (segment) using height change. Cluster rooms based on connected components, proximity, and neighborhood labels (disjoint instances of the same class are formed); use stairs to separate clusters on different floors (another level to 3DSGs). Loop closure by searching through CLIP embeddings of nodes (of the same/assigned cluster) and using closest room clusters; geometric verification using PnP (AKAZE features); optimize pose graph with least squares minimization (for odometry) and dynamic covariance scaling (DCS) loss for loop closures. Used ViT-L models for CLIP (high classification accuracy for tested scenes); tested on uHumans2 (from Kimera team), ORI, and Home (proposed) datasets. LEXIS refined with HYDRA is better than HYDRA and OneFormer (or HRNet) for room classification. Compared place recognition with DBoW (+ ORD, like in ORB-SLAM and HYDRA) and NetVLAD baselines; LEXIS has more true positives generally. Comparable performance to ORB-SLAM3 and VINS-Fusion (average translation error - ATE on Home and ORI). Connected nodes can form an adjacency matrix and planning algorithms like Dijkstra can run on the 3DSG/pose-graph data structure. From University of Oxford (ORI).
Links: arxiv, PapersWithCode
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper