Multimodal Large Language Models Cannot Spot Spatial Inconsistencies

We study whether multimodal models can detect when two views of the same static scene are geometrically inconsistent. While humans can easily spot the augmented object, SOTA multimodal LLMs often can't.

arXiv GitHub Method Results

Om Khangaonkar | Hadi J. Rad | Hamed Pirsiavash

UC Davis and Shell

Which labeled object is spatially inconsistent?

The two images depict the same static scene, except that one object in the second frame was copied from a different camera pose. The local appearance remains realistic, but the cross-view geometry is no longer consistent.

Select the inconsistent object's letter

Choose a letter before revealing the answer.

After you answer, we'll show how a few models responded.

Spatial inconsistencies from multi-view scenes

We construct controlled spatial inconsistencies using a simple, scalable, cut and paste strategy: remove an object from one real view, then paste the same object from a different real camera pose back into the scene.

Figure showing the cut-and-paste pipeline for synthesizing spatial inconsistencies. — We convert the views V1, V2, and V3 into a spatial inconsistency by: (1) selecting an object visible in all three views, (2) removing that object from V2 and inpainting the background, (3) pasting the same object from V3 back into the original location in V2. We label objects in V1 for forced-choice evaluation.

Example spatial inconsistencies

We showcase some example spatial inconsistencies with their scene attributes.

Depth: Med Light: High Physically Implausible # Labels: 5 Scene: Lounge Object: Chair Answer: D

GPT-5 (LR): C Gemini 2.5 Pro (MR): D Qwen3-VL 8B Instruct: B

Depth: Med Light: Med Physically Implausible # Labels: 10 Scene: Bathroom Object: Mirror Answer: B

GPT-5 (LR): A Gemini 2.5 Pro (MR): J Qwen3-VL 8B Instruct: J

Depth: Low Light: Med Physically Implausible # Labels: 9 Scene: Misc Object: Chair Answer: F

GPT-5 (LR): F Gemini 2.5 Pro (MR): F Qwen3-VL 8B Instruct: F

Labeled kitchen scene with window. — Frame A

Edited kitchen scene with window. — Frame B

Depth: High Light: Med Physically Plausible # Labels: 18 Scene: Kitchen Object: Window Answer: J

GPT-5 (LR): E Gemini 2.5 Pro (MR): Q Qwen3-VL 8B Instruct: J

Benchmark results

Humans consistently outperform models on every aspect of the benchmark. Furthermore, while human accuracy is robust across various attributes, model accuracy varies markedly.

Model	Overall	Model	Overall
Gemma 3 12B	8.5	Random Chance	7.9
Idefics3 8B	8.9	Human	84.8
Idefics2 8B	11.9	GPT-5 Nano	15.3
Qwen2.5-VL 7B	15.9	Gemini 2.5 Flash	17.6
InternVL 3.5 8B	16.1	GPT-4o	19.0
LLaVA OneVision 1.5 8B	16.6	Gemini 2.5 Pro (HR)	28.9
SpaceQwen2.5-VL 3B	18.2	Gemini 2.5 Pro (LR)	29.4
Llama 3.2 Multimodal 11B	23.4	Gemini 2.5 Pro (MR)	29.4
Qwen3-VL 4B	24.7	GPT-5 (HR)	30.2
Qwen3-VL 8B Thinking	25.2	GPT-5 (MR)	31.4
Qwen3-VL 8B Instruct	27.6	GPT-5 (LR)	34.2
Ensemble (open source)	30.1	Ensemble (all models)	35.0

Overall accuracy (%) on the spatial inconsistency identification task. Humans substantially outperform all tested multimodal models.

Accuracy by depth, scene lighting, and physical plausibility. — Model accuracy varies greatly across scene attributes: depth, lighting, and physical plausibility all matter.

Accuracy by inconsistent object class. — Accuracy varies greatly across the augmented object’s class, revealing brittle 3D understanding across common object types.

Accuracy by scene type. — Model accuracy also changes substantially across scene types.

Accuracy by number of labels per pair. — The number of candidate labels per pair has only a modest effect compared to the underlying perceptual challenge.

Pairwise overlap of incorrect questions and wrong answers across models. — Models often fail on similar questions, but they rarely agree on the same wrong answer.

BibTeX

TODO