UPDF AI

LocalViT: Analyzing Locality in Vision Transformers

Yawei Li,K. Zhang,2 Authors,L. Gool

2021 · DOI: 10.1109/IROS55552.2023.10342025
IEEE/RJS International Conference on Intelligent RObots and Systems · 422 Citations

TLDR

The locality mechanism is systematically investigated by carefully designed controlled experiments and the same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept.

Abstract

The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for infor-mation exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T [1], DeiT-T [2] and PVT-T [3] by 1.0%, 2.6 % and 3.1 % with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT.

Cited Papers
Citing Papers