STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

1Carnegie Mellon University, 2Bosch Center for AI
示例图片

Contributions

  • We propose a framework that incrementally builds a structured representation of the environment, enabling the VLM to make more informed decisions.
  • We design an efficient two-stage navigation policy based on this representation, combining high-level planning guided by the VLM's reasoning and low-level exploration with VLM's assistance.
  • STRIVE achieves state-of-the-art performance on simulated benchmarks (HM3D, RoboTHOR, MP3D) and shows strong performance in diverse and complex real-world environments.

Video

Abstract

We propose a novel framework that constructs a multi-layer representation of the environment during navigation. This representation consists of viewpoint, object nodes, and room nodes. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM rea- soning with low-level VLM-assisted exploration to efficiently locate a goal object.

We evaluated our approach on three simulated benchmarks (HM3D, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate (↑ 7.1%) and navigation efficiency (↑ 12.5%). We further validate our method on a real robot platform, demonstrating strong robustness across 15 object navigation tasks in 10 different indoor environments.

Method Overview

Sample Image

Overview of STRIVE. We construct a multi-layer representation R on-the-fly, consisting of object, viewpoint, and room nodes, which serves as a structured input for VLM. Based on R, we introduce a two-stage navigation policy, where the VLM reasons and plans at room-level, while the agent explores in room at the viewpoint-level using a VLM-assisted frontier-based navigation strategy and VLM-based target verification.

Benchmark Results

Sample Image

Comparison with SOTA methods with different settings on HM3D, RoboTHOR, and MP3D datasets. We report the Success Rate (SR) and Success weighted by Path Length (SPL) metrics.

Benchmark Results

Sample Image

Qualitative visualization of STRIVE. The first and second steps show the VLM’s reasoning process, where it selects Room 6 and 9 by jointly considering room-layout ('doorway'), semantic cues ('nightstand') and travel cost (penalized distance). The final step shows VLM-based verification, using contextual cues (e.g., mattress, pillows) to confirm the target object as a ‘bed’.

Real-world Experiments

Experiments on HM3D

Experiments on RoboTHOR

Experiments on MP3D