Paper
16 August 2024 Efficient vision transformer for dynamic embedding of multiscale features
Mingrui Zhang, Ronggui Wang, Juan Yang, Lixia Xue
Author Affiliations +
Proceedings Volume 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024); 132300G (2024) https://doi.org/10.1117/12.3035520
Event: Third International Conference on Machine Vision, Automatic Identification and Detection, 2024, Kunming, China
Abstract
Vision Transformer (ViT) fully demonstrates the potential of the transformer architecture in the field of computer vision. However, the computational complexity is proportional to the length of the input sequence, thus limiting the application of transformers to high-resolution images. In order to improve the overall performance of Vision Transformer, this paper proposes an efficient Vision Transformer (MLVT) with dynamic embedding of multi-scale features, adopting the pyramid architecture, replacing the self-attention operation with linear self-attention, proposing a local attention enhancement module to address the problem of the dispersal of linear self-attention scores that ignores local correlation, and supplementing the local attention with the convolution of the self-attention-like computation. operation of self-attention-like computation is utilized to supplement the local attention. Aiming at the increase of feature dimension in pyramid architecture, the bottleneck of linear self-attention computation is changed from sequence length to feature dimension, and the linear self-attention with compressed feature dimension is proposed. In addition, since multi-scale inputs are crucial for processing image information, this paper proposes a flexible and learnable dynamic multi-scale feature embedding module, which dynamically adjusts the weights of different scale features according to the input image for fusion. A large number of experiments on image classification and target detection tasks show that competitive results are achieved while reducing the computational effort.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Mingrui Zhang, Ronggui Wang, Juan Yang, and Lixia Xue "Efficient vision transformer for dynamic embedding of multiscale features", Proc. SPIE 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024), 132300G (16 August 2024); https://doi.org/10.1117/12.3035520
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Convolution

Feature extraction

Image classification

Visual process modeling

Windows

Visualization

Back to Top