Enhanced Vision Transformer with Dual-Dimensional Self-Attention for Image Recognition

TLDR

An improved model based on the Vision Transformer is presented that integrates additional self-attention mechanisms and one-dimensional convolutions to enhance the performance of the Visiontransformer block, surpassing both traditional Vision Trans transformer models and conventional convolutional neural networks when parameters and computational complexity are comparable.

Abstract

This paper presents an improved model based on the Vision Transformer that integrates additional self-attention mechanisms and one-dimensional convolutions to enhance the performance of the Vision Transformer block. The process begins by dividing the image into multiple patches and applying positional encoding. The attention mechanism is first computed for the hidden variables, followed by recalculating the attention mechanism for the patch dimension, and finally, mapping the output result using one-dimensional convolution. By incorporating this mechanism, we capture a greater degree of feature correlations, thereby enhancing the model’s expressive capabilities. Our approach yields significant improvements in image recognition performance, surpassing both traditional Vision Transformer models and conventional convolutional neural networks when parameters and computational complexity are comparable. Of particular note is its effectiveness on relatively small datasets, validating the feasibility and efficiency of our proposed method in enhancing image recognition tasks, making it a promising solution for practical applications across diverse domains.