SPARK: Stochastic Propagation via Affinity-guided Random walK for training-free unsupervised segmentation

LIVIA, ÉTS Montréal, Canada
International Laboratory on Learning Systems (ILLS)
McGILL - ETS - MILA - CNRS - Université Paris-Saclay - CentraleSupélec, Canada
Teaser Image

SPARK vs DiffCut Qualitative comparison of segmentation maps produced by DiffCut (top row) and SPARK (bottom row) on a variety of outdoor and indoor scenes. SPARK produces more spatially coherent and fine-grained segments, particularly along object boundaries.

Abstract

We argue that existing training-free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion-derived affinities. Such approaches, based on global graph partitioning and eigenvector-based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre-selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi-modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine-grained contours. To address these limitations, we reformulate training-free segmentation as a stochastic flow equilibrium problem over diffusion-induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random-walk-based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state-of-the-art zero-shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.

Teaser Image

Overview of our training-free unsupervised segmentation pipeline. From an input image, we extract self-attention maps using a frozen Diffusion U-Net encoder to form a global affinity Sglobal and a sparse local affinity Slocal. These are normalized and fused into a unified affinity S, which defines a Markov random walk over the pixel graph. Iterative Markov-flow propagation drives the system toward stable flow-preserving clusters, whose stationary distribution is used for label assignment, yielding the final segmentation.

Unsupervised segmentation results across six benchmarks. We report zero-shot mIoU using diffusion features. Best results are shown in bold, second-best are underlined, and differences with the second best method are indicated in green. SPARK consistently outperforms prior training-free approaches, including DiffCut and Seg4Diff on all the datasets.

Model VOC Context COCO-Object COCO-Stuff-27 Cityscapes ADE20K
ReCO (NeurIPS'22) 25.119.915.726.319.311.2
MaskCLIP (ICML'23) 38.823.620.619.610.09.8
MaskCut (CVPR'23) 53.843.430.141.718.735.7
iSeg (Arxiv'24) ×××45.225.0×
DiffSeg (CVPR'24) 49.848.823.244.216.837.7
DiffCut (NeurIPS'24) 65.256.534.149.130.644.3
Seg4Diff (NeurIPS'25) 54.952.638.549.724.244.9
SPARK 66.9 (+1.7) 57.7 (+1.2) 42.7 (+4.2) 52.0 (+2.3) 33.9 (+3.5) 48.0 (+3.1)

BibTeX

@article{mahatha2025nerve,
      title={NERVE: Neighbourhood \& Entropy-guided Random-walk for training free open-Vocabulary sEgmentation},
      author={Mahatha, Kunal and Dolz, Jose and Desrosiers, Christian},
      journal={arXiv preprint arXiv:2511.08248},
      year={2025}
    }