3D scene understanding enables embodied agents to perceive and interact with their environment. Recent advances in visual-language models have introduced open-vocabulary capabilities which generalize beyond predefined label sets. However, evaluating these representations remains challenging. Most current evaluation methods rely on closed-set metrics or task-specific demonstrations. To address this, we introduce OpenLex3D, a benchmark designed to comprehensively assess open-vocabulary scene representations. OpenLex3D enhances 23 scenes from widely used indoor RGB-D datasets (Replica, ScanNet++, HM3D) with human-annotated labels that capture real-world linguistic variability and span multiple accuracy levels. The benchmark features two tasks for assessment: 3D object segmentation and object retrieval. We use OpenLex3D to evaluate both object-centric and dense open-vocabulary methods and provide deeper insights into their strengths and limitations.
We create new label sets for three widely-used RGB-D datasets: ScanNet++, Replica, and HM3D. The label sets consist of different categories with varying precision: synonyms being the most precise; depictions, which include, e.g. printed images on objects; visually similar, which refer to objects with comparable appearance; and clutter, which acccounts for label pertubation due to imprecise segmentation or crop scaling.
We provide evaluation on two tasks using our label sets: semantic segmentation and object retrieval given a text query. We introduce two novel open-set metrics (above) for segmentation and an extended query set for object retrieval.
The two semantic segmentation metrics evaluate on either the object-level or the feature level. (a) Top-N IoU measures whether any of the top-N responses contain a label from category C. (b) Set Ranking evaluates the ranking of the responses, assessing how closely the predicted rankings align with ideal rankings of categories.
We show Top-5 IoU results colored by category class. Object-centric methods that segment in 3D (OpenMask3D, Kassab2024) often miss points due to generalization or depth quality issues. Those merging 2D segments (ConceptGraphs, HOV-SG) tend to merge small segments together, leading to misclassifications. Dense representations (ConceptFusion, OpenScene) produce noisier predictions due to point-level features aggregating information from various context scales.
For set ranking and object retrieval results tables check out the paper. We also provide evaluation visualization tools in our toolkit.
The authors would like to thank Ulrich-Michael, Frances, James, Maryam, and Mandolyn for their help in labeling the dataset. The work at the Université de Montréal was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) (Paull), an NSERC PGS D Scholarship (Morin) and an FRQNT Doctoral Scholarship (Morin). Moreover, this research was enabled in part by compute resources provided by Mila (mila.quebec). The work at the University of Freiburg was funded by an academic grant from NVIDIA. The work at the University of Oxford was supported by a Royal Society University Research Fellowship (Fallon, Kassab), and EPSRC C2C Grant EP/Z531212/1 (Mattamala).