We study whether multimodal models can detect when two views of the same static scene are geometrically inconsistent. While humans can easily spot the augmented object, SOTA multimodal LLMs often can't.
The two images depict the same static scene, except that one object in the second frame was copied from a different camera pose. The local appearance remains realistic, but the cross-view geometry is no longer consistent.
Select the inconsistent object's letter
After you answer, we'll show how a few models responded.
We construct controlled spatial inconsistencies using a simple, scalable, cut and paste strategy: remove an object from one real view, then paste the same object from a different real camera pose back into the scene.
We showcase some example spatial inconsistencies with their scene attributes.
Humans consistently outperform models on every aspect of the benchmark. Furthermore, while human accuracy is robust across various attributes, model accuracy varies markedly.
| Model | Overall | Model | Overall |
|---|---|---|---|
| Gemma 3 12B | 8.5 | Random Chance | 7.9 |
| Idefics3 8B | 8.9 | Human | 84.8 |
| Idefics2 8B | 11.9 | GPT-5 Nano | 15.3 |
| Qwen2.5-VL 7B | 15.9 | Gemini 2.5 Flash | 17.6 |
| InternVL 3.5 8B | 16.1 | GPT-4o | 19.0 |
| LLaVA OneVision 1.5 8B | 16.6 | Gemini 2.5 Pro (HR) | 28.9 |
| SpaceQwen2.5-VL 3B | 18.2 | Gemini 2.5 Pro (LR) | 29.4 |
| Llama 3.2 Multimodal 11B | 23.4 | Gemini 2.5 Pro (MR) | 29.4 |
| Qwen3-VL 4B | 24.7 | GPT-5 (HR) | 30.2 |
| Qwen3-VL 8B Thinking | 25.2 | GPT-5 (MR) | 31.4 |
| Qwen3-VL 8B Instruct | 27.6 | GPT-5 (LR) | 34.2 |
| Ensemble (open source) | 30.1 | Ensemble (all models) | 35.0 |
Overall accuracy (%) on the spatial inconsistency identification task. Humans substantially outperform all tested multimodal models.
TODO