Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Sergey Linok1†, Tatiana Zemskova1,2, Svetlana Ladanova1, Roman Titkov1, Dmitry Yudin1,2, Maxim Monastyrny3, Aleksei Valenkov3
1Moscow Institute of Physics and Technology (MIPT), 2AIRI, 3Sberbank of Russia, Robotics Center
means corresponding author

Abstract

Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects.

Method

Interpolate start reference image

Proposed BBQ approach leverages foundation models for high-performance construction of an object-centric class-agnostic 3D map of a static indoor environment from a sequence of RGB-D frames with known camera poses and calibration. To perform scene understanding, we represent environment as a set of nodes with spatial relations. Utilizing a designed deductive scene reasoning algorithm, our method enable efficient natural language interaction with a scene-aware large language model.

Interpolate start reference image

An object-centric class-agnostic 3D map is iteratively constructed from a sequence of RGB-D camera frames and their poses by associating 2D MobileSAMv2 mask proposals with 3D objects with deep DINOv2 visual features and spatial constraints (Sec. III-A). To visually represent objects after building the map, we select the best view based on the largest projected mask from L cluster centroids that represent areas of object observations (Sec. III-B). We leverage LLaVA [15] to describe object visual properties (Sec. III-C). With the node’s text descriptions, spatial locations, metric and semantic edges (Sec. III-D) we utilize LLM in our deductive reasoning algorithm (Sec. III-E) to perform a 3D object grounding task.

Results

3D open-vocabulary semantic segmentation

Interpolate start reference image Interpolate start reference image

3D object grounding

Interpolate start reference image Interpolate start reference image Interpolate start reference image

BibTeX

@misc{linok2024barequeriesopenvocabularyobject,
      title={Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph}, 
      author={Sergey Linok and Tatiana Zemskova and Svetlana Ladanova and Roman Titkov and Dmitry Yudin and Maxim Monastyrny and Aleksei Valenkov},
      year={2024},
      eprint={2406.07113},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.07113}, 
}