Res-ViT: Residual Vision Transformers for Image Recognition Tasks.

TLDR

A new architecture, named Residual Vision Transformer (Res-Vit), that improves Vision Transformer (ViT) in performance and efficiency by introducing Residual networks into ViT to yield the best of both designs, proving that the Res-ViT model is a promising approach for image classification offering improved performance, efficiency, and robustness.

Abstract

Transformers have recently dominated a wide range of tasks in natural language processing. To obtain competitive image classification performance, the Vision Transformer (ViT) is the first computer vision model to rely exclusively on the Transformer architecture. Despite the success of vision Transformers at large scale, the performance is still below similarly sized convolutional neural network (CNN) counterparts (e.g., ResNets). We present in this paper a new architecture, named Residual Vision Transformer (Res-Vit), that improves Vision Transformer (ViT) in performance and efficiency by introducing Residual networks into ViT to yield the best of both designs. The classic architecture of ViT is mainly modified by: (i) a hierarchy of Transformers containing a new residual token embedding, and (ii) a residual Transformer block leveraging a residual projection. Moreover, the positional encoding, a crucial component in existing vision transformers, can be safely removed in the ResVit model, simplifying the design for higher resolution vision tasks. We validate Res-Vit by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our Res-Vit obtains a top-1 accuracy on the ImageNet-1k proving that the Res-ViT model is a promising approach for image classification offering improved performance, efficiency, and robustness.