CBPT: A New Backbone for Enhancing Information Transmission of Vision Transformers

TLDR

The Locally-Enhanced Window Self-attention mechanism is developed to double the receptive field and have a similar computational complexity to the typical WSA, and the Information-Enhanced Patch Merging, which solves the loss of information in sampling the attention map.

Abstract

This paper presents an efficient multi-scale vision transformer, called CBPT, that capably serves as a general-purpose backbone for computer vision. A challenging issue in transformer design is that window self-attention(WSA) often limits the information transmission of each token, whereas enlarging WSA’s receptive field is very expensive to compute. To address this issue, we develop the Locally-Enhanced Window Self-attention mechanism to double the receptive field and have a similar computational complexity to the typical WSA. In addition, we also propose Information-Enhanced Patch Merging, which solves the loss of information in sampling the attention map. Incorporated with these designs and the Cross Block Partial connection, CBPT not only significantly surpasses Swin by +1 box AP and mask AP on COCO object detection and instance segmentation, but also has 30% fewer parameters and 35% fewer FLOPs than Swin.