SpaceDG logo SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Xiaolong Zhou2,3 Yifei Liu2,6 Ziyang Gong1 Jiarui Li4 Qiyue Zhao4 Muyao Niu5
Yuanyuan Gao2,7 Le Ma2 Xue Yang1 Hongjie Zhang2 Zhihang Zhong1,†
1Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory
3University of Electronic Science and Technology of China 4Chongqing University
5The University of Tokyo 6Beihang University 7Northwestern Polytechnical University

Corresponding author

TL;DR: We introduce SpaceDG, the first large-scale dataset for degradation-aware spatial intelligence, and SpaceDG-Bench, a human-verified benchmark for evaluating MLLMs under visual degradations.

Visual Degradation Example

Clean

Degraded

Motion Blur

Abstract

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: How robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs with over 160K images. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 unique questions spanning 11 question types and 9 visual degradation types, yielding 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

SpaceDG teaser

A physically grounded synthesis pipeline embeds 9 degradation types into 3D Gaussian Splatting rendering, yielding 971K QA pairs across nearly 1,000 indoor scenes.

SpaceDG at a Glance

971K
QA pairs (SpaceDG)
1.1K
benchmark questions
11
question types
9
degradation types
25
models evaluated

Example

SpaceDG-Bench example QA instances

Distribution

SpaceDG-Bench taxonomy and degradation distribution

Taxonomy of question types and degradation types in SpaceDG-Bench.

Leader Board

SpaceDG-Bench evaluation results (leaderboard)

Key Findings

1. Degradations consistently impair spatial reasoning

Visual degradations consistently impair spatial reasoning across all evaluated MLLMs, highlighting the need for degradation-aware spatial evaluation.

2. Humans also degrade — models should go beyond imitation

Humans also suffer clear performance drops under degraded conditions. MLLMs should learn degradation-aware spatial knowledge rather than simply imitate human perception.

3. Degradation-aware fine-tuning helps on clean and degraded inputs

Degradation-based supervised fine-tuning yields substantial improvements on both clean and degraded inputs, indicating that physically grounded degradations can enhance robust spatial understanding.

4. Fine-grained perception is more sensitive than geometry

Visual degradations affect fine-grained object-level perception (e.g., object counting) more strongly than certain geometric reasoning tasks (e.g., camera-centric translation), revealing that detailed visual grounding is particularly sensitive to degraded evidence.

Method Overview

SpaceDG pipeline overview

Illustration of our proposed SpaceDG pipeline: Starting from multi-view indoor captures, we optimize 3D Gaussian Splatting with depth priors, apply physically grounded pre- and post-render degradation synthesis, and generate spatial QA pairs via structured question design. An MLLM filter and human experts refine candidates into a 971K training set and a 1.1K human-verified benchmark (SpaceDG-Bench).

3D Annotation & Depth Visualization

Our 3D annotation pipeline mainly follows Holi-Spatial. For the detailed data construction procedure, please refer to Holi-Spatial.

3D Annotations

Scene 0a76e06478

Scene 1a22a99186

Scene b5918e4637

Refined Depth

Scene 0a76e06478

Scene 1a22a99186

Scene b5918e4637

Degradation-wise Correlation Analysis

Degradation-wise correlation analysis on SpaceDG-Bench

Degradation-wise correlation analysis: We quantify how sensitive spatial reasoning is to each degradation using the absolute point-biserial Pearson correlation |r| between clean and degraded scores across all evaluated models (panels a–d show overall degradation effects, answer format, task group, and atomic question type, respectively). Low light and haze consistently induce the largest performance drops, whereas over-exposure and distortion are comparatively milder. Multiple-choice questions show higher degradation correlation than numerical-answer questions; object-centric tasks are more sensitive than camera-centric tasks; and fine-grained perception tasks (e.g., object existence and counting) correlate more strongly with degradations than global geometric tasks (e.g., camera translation). These results suggest that visual degradations primarily impair fine-grained semantic perception and thus disproportionately affect tasks requiring detailed visual grounding.

Quick Start (Evaluation)

SpaceDG-Bench is integrated with our EASI evaluation toolkit (VLMEvalKit). Download the benchmark from Hugging Face, and see the GitHub repository for full setup instructions and launcher scripts.