Multimodal Large Language Models Cannot Spot Spatial Inconsistencies

We study whether multimodal models can detect when two views of the same static scene are geometrically inconsistent. While humans can easily spot the augmented object, SOTA multimodal LLMs often can't.

Teaser pair showing two views of a scene, with one object spatially inconsistent.

Which labeled object is spatially inconsistent?

The two images depict the same static scene, except that one object in the second frame was copied from a different camera pose. The local appearance remains realistic, but the cross-view geometry is no longer consistent.

Select the inconsistent object's letter

After you answer, we'll show how a few models responded.

Spatial inconsistencies from multi-view scenes

We construct controlled spatial inconsistencies using a simple, scalable, cut and paste strategy: remove an object from one real view, then paste the same object from a different real camera pose back into the scene.

Figure showing the cut-and-paste pipeline for synthesizing spatial inconsistencies.
We convert the views V1, V2, and V3 into a spatial inconsistency by: (1) selecting an object visible in all three views, (2) removing that object from V2 and inpainting the background, (3) pasting the same object from V3 back into the original location in V2. We label objects in V1 for forced-choice evaluation.

Example spatial inconsistencies

We showcase some example spatial inconsistencies with their scene attributes.

Labeled lounge scene.
Frame A
Edited lounge scene.
Frame B
Depth: Med Light: High Physically Implausible # Labels: 5 Scene: Lounge Object: Chair Answer: D
GPT-5 (LR): C Gemini 2.5 Pro (MR): D Qwen3-VL 8B Instruct: B
Labeled bathroom scene.
Frame A
Edited bathroom scene.
Frame B
Depth: Med Light: Med Physically Implausible # Labels: 10 Scene: Bathroom Object: Mirror Answer: B
GPT-5 (LR): A Gemini 2.5 Pro (MR): J Qwen3-VL 8B Instruct: J
Labeled room with chairs.
Frame A
Edited room with chairs.
Frame B
Depth: Low Light: Med Physically Implausible # Labels: 9 Scene: Misc Object: Chair Answer: F
GPT-5 (LR): F Gemini 2.5 Pro (MR): F Qwen3-VL 8B Instruct: F
Labeled kitchen scene with window.
Frame A
Edited kitchen scene with window.
Frame B
Depth: High Light: Med Physically Plausible # Labels: 18 Scene: Kitchen Object: Window Answer: J
GPT-5 (LR): E Gemini 2.5 Pro (MR): Q Qwen3-VL 8B Instruct: J

Benchmark results

Humans consistently outperform models on every aspect of the benchmark. Furthermore, while human accuracy is robust across various attributes, model accuracy varies markedly.

Model Overall Model Overall
Gemma 3 12B8.5Random Chance7.9
Idefics3 8B8.9Human84.8
Idefics2 8B11.9GPT-5 Nano15.3
Qwen2.5-VL 7B15.9Gemini 2.5 Flash17.6
InternVL 3.5 8B16.1GPT-4o19.0
LLaVA OneVision 1.5 8B16.6Gemini 2.5 Pro (HR)28.9
SpaceQwen2.5-VL 3B18.2Gemini 2.5 Pro (LR)29.4
Llama 3.2 Multimodal 11B23.4Gemini 2.5 Pro (MR)29.4
Qwen3-VL 4B24.7GPT-5 (HR)30.2
Qwen3-VL 8B Thinking25.2GPT-5 (MR)31.4
Qwen3-VL 8B Instruct27.6GPT-5 (LR)34.2
Ensemble (open source)30.1Ensemble (all models)35.0

Overall accuracy (%) on the spatial inconsistency identification task. Humans substantially outperform all tested multimodal models.

Accuracy by depth, scene lighting, and physical plausibility.
Model accuracy varies greatly across scene attributes: depth, lighting, and physical plausibility all matter.
Accuracy by inconsistent object class.
Accuracy varies greatly across the augmented object’s class, revealing brittle 3D understanding across common object types.
Accuracy by scene type.
Model accuracy also changes substantially across scene types.
Accuracy by number of labels per pair.
The number of candidate labels per pair has only a modest effect compared to the underlying perceptual challenge.
Pairwise overlap of incorrect questions and wrong answers across models.
Models often fail on similar questions, but they rarely agree on the same wrong answer.

BibTeX

TODO