NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary sEgmentation

LIVIA, ÉTS Montréal, Canada
International Laboratory on Learning Systems (ILLS)
McGILL - ETS - MILA - CNRS - Université Paris-Saclay - CentraleSupélec, Canada
Teaser Image

Progressive segmentation refinement via Stochastic Random Walk. For each image, we visualize the effect of increasing random walk steps on the segmentation map. Regions of interest are highlighted with colored boxes. Our method produces increasingly accurate and coherent segmentations over time, illustrating strong local–global propagation.

Abstract

Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of trans- former attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We pro- pose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood & Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely inte- grates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochas- tic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spa- tial diffusion process encourages propagation across con- nected and semantically related areas, enabling it to ef- fectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from dif- ferent transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conven- tional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmen- tation benchmarks, yielding an overall state-of-the-art zero- shot segmentation performance, providing an effective ap- proach to open-vocabulary semantic segmentation.

Teaser Image

Overview of our training-free open-vocabulary segmentation pipeline. Given an image and a set of text prompts, we extract cross-attention maps using CLIP and self-attention maps from a Stable Diffusion encoder. We compute entropy-guided fusion across attention heads h to obtain global (Aglobal) and local (Alocal) affinities. These are normalized and linearly combined into a final stochastic matrix S, which is used in a truncated stochastic random walk to propagate semantic information and generate refined segmentation masks.

Quantitative evaluation across five datasets and two of their variants. The first three benchmarks (V21, PC60, and C-Obj) include a background category, whereas the subsequent ones do not. The Post. column indicates whether an approach employs post-processing for mask refinement.

Method Post. V21 PC60 C-Obj V20 ADE PC59 C-Stf Avg
CLIP18.67.86.549.13.211.25.713.6
MaskCLIP43.423.220.674.916.726.416.730.3
GroupViT52.318.727.579.715.323.415.330.7
CLIP-DIY59.0-30.4-----
GEM46.2---15.732.6--
SCLIP59.130.430.580.416.134.222.438.2
PixelCLIP---85.920.337.923.6-
NACLIP58.932.233.279.917.435.223.339.4
iSeg68.230.938.4-----
NERVE 69.7 (+1.5) 37.7 (+5.5) 43.3 (+4.9) 90.1 (+4.2) 24.0 (+3.7) 43.4 (+5.5) 28.8 (+5.2) 48.1 (+4.3)
ReCo25.119.915.757.711.222.314.823.8
SCLIP61.731.532.183.517.836.123.940.1
ClearCLIP46.126.730.180.015.029.619.935.3
ProxyCLIP60.634.539.283.222.637.725.643.3
OVDiff---80.914.132.220.3-
CaR67.630.536.691.417.739.5--
NACLIP64.135.036.283.019.138.425.742.5
NERVE 69.7 (+2.1) 37.7 (+2.7) 43.3 (+4.1) 90.1 (-1.3) 24.0 (+1.4) 43.4 (+3.9) 28.8 (+3.1) 48.1 (+2.3)

BibTeX

@article{mahatha2025nerve,
      title={NERVE: Neighbourhood \& Entropy-guided Random-walk for training free open-Vocabulary sEgmentation},
      author={Mahatha, Kunal and Dolz, Jose and Desrosiers, Christian},
      journal={arXiv preprint arXiv:2511.08248},
      year={2025}
    }