Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

CVPR 2026 (Highlight)

Ziren Gong¹, Xiaohan Li², Fabio Tosi¹, Jiawei Han³, Stefano Mattocia¹, Jianfei Cai⁴, Matteo Poggi¹,

¹University of Bologna, ²University of Hongkong ³Beijing Institute of Technology ⁴Monash University

Paper Video Code (coming soon) Poster (coming soon) Results

Pipeline of Ov3R. Ov3R is an Open-Vocabulary Semantic 3D Reconstruction Framework. It consists of two novel feed-forward modules, CLIP3R and 2D–3D OVS, and excels in both 3D reconstruction and open-vocabulary 3D semantic segmentation.

Abstract

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI.

The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D–3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment.

Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation — marking a step forward toward real-time, semantics-aware Spatial AI.

Framework

Annotations: Given RGB-only videos, we first apply CLIP3R to produce scene points while SAM predicts 2D segments. Each 2D segment is matched to its corresponding 3D points to obtain 3D semantics. Next, the 2D-3D OVS extracts the fused 2D-3D descriptor to compute the cosine similarity with the text embeddings corresponding to a set of semantic classes.

Visualization

Annotations: We present open-vocabulary segmentation results of several different objects. Ov3R demonstrates accurate segmentations for diverse situations, e.g. multiple objects (pillows, chairs), small objects (vases), and large objects (tables).

Annotations: For each scene, we provide ground truth pointmaps as references, and compare our reconstructed indoor scenes against SLAM3R and Spann3R. Ov3R presents stronger performance on geometry consistency and accuracy, while SLAM3R outputs misaligned recosntructions and Spann3R predicts very sparse pointmaps.

Reconstruction & Segmentation Results

Our final reconstruction and segmentation results are available now. You can find it in the above Results button or you can download it through Google Drive.

BibTeX

@article{gong2025ov3r,
  title={Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos},
  author={Gong, Ziren and Li, Xiaohan and Tosi, Fabio and Han, Jiawei and Mattoccia, Stefano and Cai, Jianfei and Poggi, Matteo},
  journal={arXiv preprint arXiv:2507.22052},
  year={2025}
}