SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: How robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs with over 160K images. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 unique questions spanning 11 question types and 9 visual degradation types, yielding 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

Illustration of our proposed SpaceDG pipeline: Starting from multi-view indoor captures, we optimize 3D Gaussian Splatting with depth priors, apply physically grounded pre- and post-render degradation synthesis, and generate spatial QA pairs via structured question design. An MLLM filter and human experts refine candidates into a 971K training set and a 1.1K human-verified benchmark (SpaceDG-Bench).

Degradation-wise correlation analysis: We quantify how sensitive spatial reasoning is to each degradation using the absolute point-biserial Pearson correlation |r| between clean and degraded scores across all evaluated models (panels a–d show overall degradation effects, answer format, task group, and atomic question type, respectively). Low light and haze consistently induce the largest performance drops, whereas over-exposure and distortion are comparatively milder. Multiple-choice questions show higher degradation correlation than numerical-answer questions; object-centric tasks are more sensitive than camera-centric tasks; and fine-grained perception tasks (e.g., object existence and counting) correlate more strongly with degradations than global geometric tasks (e.g., camera translation). These results suggest that visual degradations primarily impair fine-grained semantic perception and thus disproportionately affect tasks requiring detailed visual grounding.

SpaceDG-Bench is integrated with our EASI evaluation toolkit (VLMEvalKit). Download the benchmark from Hugging Face, and see the GitHub repository for full setup instructions and launcher scripts.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Visual Degradation Example

Abstract

SpaceDG at a Glance

Example

Distribution

Leader Board

Key Findings

1. Degradations consistently impair spatial reasoning

2. Humans also degrade — models should go beyond imitation

3. Degradation-aware fine-tuning helps on clean and degraded inputs

4. Fine-grained perception is more sensitive than geometry

Method Overview

3D Annotation & Depth Visualization

Degradation-wise Correlation Analysis

Quick Start (Evaluation)