OpenLex3D: A New Benchmark for Open-Vocabulary 3D Scene Representations


1University of Oxford      2Université de Montréal      3University of Freiburg
4Mila - Quebec AI Institute      5Canada CIFAR AI Chair
*Indicates Equal Contribution
Banner Image

TL;DR Unlike closed-vocabulary evaluation, the OpenLex3D evaluation benchmark provides a manifold of label categories of varying precision which allows detailed analysis of open-vocabulary scene representation methods.

Abstract

3D scene understanding enables embodied agents to perceive and interact with their environment. Recent advances in visual-language models have introduced open-vocabulary capabilities which generalize beyond predefined label sets. However, evaluating these representations remains challenging. Most current evaluation methods rely on closed-set metrics or task-specific demonstrations. To address this, we introduce OpenLex3D, a benchmark designed to comprehensively assess open-vocabulary scene representations. OpenLex3D enhances 23 scenes from widely used indoor RGB-D datasets (Replica, ScanNet++, HM3D) with human-annotated labels that capture real-world linguistic variability and span multiple accuracy levels. The benchmark features two tasks for assessment: 3D object segmentation and object retrieval. We use OpenLex3D to evaluate both object-centric and dense open-vocabulary methods and provide deeper insights into their strengths and limitations.

Benchmark Design


We create new label sets for three widely-used RGB-D datasets: ScanNet++, Replica, and HM3D. The label sets consist of different categories with varying precision: synonyms being the most precise; depictions, which include, e.g. printed images on objects; visually similar, which refer to objects with comparable appearance; and clutter, which acccounts for label pertubation due to imprecise segmentation or crop scaling.

Metrics

We provide evaluation on two tasks using our label sets: semantic segmentation and object retrieval given a text query. We introduce two novel open-set metrics (above) for segmentation and an extended query set for object retrieval.


The two semantic segmentation metrics evaluate on either the object-level or the feature level. (a) Top-N IoU measures whether any of the top-N responses contain a label from category C. (b) Set Ranking evaluates the ranking of the responses, assessing how closely the predicted rankings align with ideal rankings of categories.

Semantic Segmentation Evaluation

We show Top-5 IoU results colored by category class. Object-centric methods that segment in 3D (OpenMask3D, Kassab2024) often miss points due to generalization or depth quality issues. Those merging 2D segments (ConceptGraphs, HOV-SG) tend to merge small segments together, leading to misclassifications. Dense representations (ConceptFusion, OpenScene) produce noisier predictions due to point-level features aggregating information from various context scales.

For set ranking and object retrieval results tables check out the paper. We also provide evaluation visualization tools in our toolkit.



Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom


Banner Image

Acknowledgements

The authors would like to thank Ulrich-Michael, Frances, James, Maryam, and Mandolyn for their help in labeling the dataset. The work at the Université de Montréal was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) (Paull), an NSERC PGS D Scholarship (Morin) and an FRQNT Doctoral Scholarship (Morin). Moreover, this research was enabled in part by compute resources provided by Mila (mila.quebec). The work at the University of Freiburg was funded by an academic grant from NVIDIA. The work at the University of Oxford was supported by a Royal Society University Research Fellowship (Fallon, Kassab), and EPSRC C2C Grant EP/Z531212/1 (Mattamala).