By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for visual perception and scene understanding? We finetune two different types of generative models for category-agnostic instance segmentation: Stable Diffusion and ImageNet-pretrained MAE. Surprisingly, when finetuned exclusively on a narrow set of object types (indoor furnishings and cars), both models exhibit strong zero-shot generalization, accurately segmenting object types and image styles unseen in finetuning. Our models excel at segmenting fine structures, ambiguous boundaries, and occluded objects.
The gallery below presents several images and a comparison of our finetuned Stable Diffusion and MAE-H with SAM. Use the slider and gestures to reveal details on both sides. Interestingly, MAE-H generalizes zero-shot to several images with object types and styles never present in its ImageNet pretraining as well, as shown below. Some artifacts are present in MAE-H's predictions as finetuning and inference are performed at 224x224 resolution to match its pretraining. For SAM's predictions, black areas are where no object was detected.
Our models assign similar colors to compositionally related parts of a scene. Vader’s mask and body (top), or the bowties and shirts (bottom) are separated by subtly different hues, while distinct colors partition unrelated parts such as his leg and the poles (top), or the dogs and text (bottom). This emerges without any part-level supervision, suggesting generative models learn hierarchical scene representations. See if you can spot this above, too!
We introduce a simple method to prompt our model for binary masks. We do this by first averaging feature colors near our prompt point $p$ to create a query feature $q_p$. We then compute similarity between $q_p$ and all pixels in our feature map, and threshold the resulting similarity map for a binary mask. See the paper for more details.
Refer to the PDF paper linked above for more details on training and quantitative results.
@misc{khangaonkar2025gen2seg, title={gen2seg: Generative Models Enable Generalizable Instance Segmentation}, author={Om Khangaonkar and Hamed Pirsiavash}, year={2025}, eprint={2505.15263}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.15263}, }