OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection

Kunal Mahatha1, Ali Bahri1, Pierre Marza2, Sahar Dastani1, Maria Vakalopoulou2, Stergios Christodoulidis2, Jose Dolz1, Christian Desrosiers1
1ILLS, LIVIA, ÉTS Montréal, Canada 2MICS, CentraleSupélec, Université Paris-Saclay
Teaser Image

Qualitative comparison of segmentation results. Octopus produces more complete and spatially consistent predictions topus produces more complete and spatially consistent predictions than VMamba across diverse scenes. In the first example, Octopus correctly segments the road and yields a cleaner mask for the van. In the indoor scene, Octopus additionally identifies the ceiling light and the side door, which VMamba fails to detect. Overall, Octopus demonstrates stronger boundary preservation and improved object completeness.

Abstract

State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS% , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.

Teaser Image

Overall Architecture of Octopus: (a) The full hierarchical encoder consisting of four stages. Each stage applies multiple O-VSS blocks, with spatial resolution preserved inside the stage and reduced via downsampling between stages. (b) The O-VSS block design, composed of Layer Normalization, an O-SS2D block (which integrates the proposed 8-direction Selective Scan), and an FFN. The O-SS2D block itself combines O-Scan operations with an O-Merge module to fuse multi-directional features.

ERF evolution across models
ERF evolution across models Comparison of effective receptive fields (ERFs) before and after training across different architectures. Ours shows the most structured and spatially aware ERF pattern.

Classification performance on miniImageNet. Input images are of size 224 x 224. T, S, B, M, and N denote the tiny, small, base, micro, and nano scales across each method.

Model FLOPs (G) Top-1 (%) Top-5 (%)
Transformer-Based
DeiT-S4.670.8389.74
DeiT-B17.572.4390.14
Swin-T4.583.2595.54
Swin-S8.784.1095.53
Swin-B15.482.7795.33
XCiT-S249.285.7996.31
XCiT-M2416.286.8096.38
SSM-Based
Vim-T1.567.3087.97
Vim-S5.179.7093.20
LocalVim-T1.582.1294.60
LocalVim-S4.881.6893.63
MSVMamba-N0.982.1695.10
MSVMamba-M1.583.7295.81
MSVMamba-T4.686.4896.43
VMamba-T4.985.8296.37
VMamba-S8.786.4896.79
Ours-T 6.1 86.60 (+0.78) 96.63 (+0.26)
Ours-S 10.0 86.89 (+0.41) 96.81 (+0.02)

Results of semantic segmentation on ADE20K. SS and MS denote single-scale and multi-scale testing, respectively.

Method mIoU (SS) mIoU (MS)
VMamba-T22.7723.98
VMamba-S25.8427.13
VMamba-B26.3228.41
SpectralVMamba-T25.5827.72
SpectralVMamba-S27.5229.34
SpectralVMamba-B27.8128.99
Ours-T 37.93 (+15.16) 37.98 (+14.00)
Ours-S 39.04 (+13.12) 39.32 (+12.19)
Ours-B 40.32 (+14.00) 40.95 (+12.54)

BibTeX

@misc{mahatha2026octopusenhancingspatialawarenessvision,
      title={OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection}, 
      author={Kunal Mahatha and Ali Bahri and Pierre Marza and Sahar Dastani and Maria Vakalopoulou and Stergios Christodoulidis and Jose Dolz and Christian Desrosiers},
      year={2026},
      eprint={2602.00904},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.00904}, 
}