Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of trans- former attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We pro- pose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood & Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely inte- grates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochas- tic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spa- tial diffusion process encourages propagation across con- nected and semantically related areas, enabling it to ef- fectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from dif- ferent transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conven- tional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmen- tation benchmarks, yielding an overall state-of-the-art zero- shot segmentation performance, providing an effective ap- proach to open-vocabulary semantic segmentation.
| Method | Post. | V21 | PC60 | C-Obj | V20 | ADE | PC59 | C-Stf | Avg |
|---|---|---|---|---|---|---|---|---|---|
| CLIP | ✗ | 18.6 | 7.8 | 6.5 | 49.1 | 3.2 | 11.2 | 5.7 | 13.6 |
| MaskCLIP | ✗ | 43.4 | 23.2 | 20.6 | 74.9 | 16.7 | 26.4 | 16.7 | 30.3 |
| GroupViT | ✗ | 52.3 | 18.7 | 27.5 | 79.7 | 15.3 | 23.4 | 15.3 | 30.7 |
| CLIP-DIY | ✗ | 59.0 | - | 30.4 | - | - | - | - | - |
| GEM | ✗ | 46.2 | - | - | - | 15.7 | 32.6 | - | - |
| SCLIP | ✗ | 59.1 | 30.4 | 30.5 | 80.4 | 16.1 | 34.2 | 22.4 | 38.2 |
| PixelCLIP | ✗ | - | - | - | 85.9 | 20.3 | 37.9 | 23.6 | - |
| NACLIP | ✗ | 58.9 | 32.2 | 33.2 | 79.9 | 17.4 | 35.2 | 23.3 | 39.4 |
| iSeg | ✗ | 68.2 | 30.9 | 38.4 | - | - | - | - | - |
| NERVE | ✗ | 69.7 (+1.5) | 37.7 (+5.5) | 43.3 (+4.9) | 90.1 (+4.2) | 24.0 (+3.7) | 43.4 (+5.5) | 28.8 (+5.2) | 48.1 (+4.3) |
| ReCo | ✔ | 25.1 | 19.9 | 15.7 | 57.7 | 11.2 | 22.3 | 14.8 | 23.8 |
| SCLIP | ✔ | 61.7 | 31.5 | 32.1 | 83.5 | 17.8 | 36.1 | 23.9 | 40.1 |
| ClearCLIP | ✔ | 46.1 | 26.7 | 30.1 | 80.0 | 15.0 | 29.6 | 19.9 | 35.3 |
| ProxyCLIP | ✔ | 60.6 | 34.5 | 39.2 | 83.2 | 22.6 | 37.7 | 25.6 | 43.3 |
| OVDiff | ✔ | - | - | - | 80.9 | 14.1 | 32.2 | 20.3 | - |
| CaR | ✔ | 67.6 | 30.5 | 36.6 | 91.4 | 17.7 | 39.5 | - | - |
| NACLIP | ✔ | 64.1 | 35.0 | 36.2 | 83.0 | 19.1 | 38.4 | 25.7 | 42.5 |
| NERVE | ✗ | 69.7 (+2.1) | 37.7 (+2.7) | 43.3 (+4.1) | 90.1 (-1.3) | 24.0 (+1.4) | 43.4 (+3.9) | 28.8 (+3.1) | 48.1 (+2.3) |
@article{mahatha2025nerve,
title={NERVE: Neighbourhood \& Entropy-guided Random-walk for training free open-Vocabulary sEgmentation},
author={Mahatha, Kunal and Dolz, Jose and Desrosiers, Christian},
journal={arXiv preprint arXiv:2511.08248},
year={2025}
}