Efficient vision transformer for dynamic embedding of multiscale features

Mingrui Zhang; Ronggui Wang; Juan Yang; Lixia Xue

doi:10.1117/12.3035520

16 August 2024 Efficient vision transformer for dynamic embedding of multiscale features

Mingrui Zhang, Ronggui Wang, Juan Yang, Lixia Xue

Proceedings Volume 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024); 132300G (2024) https://doi.org/10.1117/12.3035520
Event: Third International Conference on Machine Vision, Automatic Identification and Detection, 2024, Kunming, China

Abstract

Vision Transformer (ViT) fully demonstrates the potential of the transformer architecture in the field of computer vision. However, the computational complexity is proportional to the length of the input sequence, thus limiting the application of transformers to high-resolution images. In order to improve the overall performance of Vision Transformer, this paper proposes an efficient Vision Transformer (MLVT) with dynamic embedding of multi-scale features, adopting the pyramid architecture, replacing the self-attention operation with linear self-attention, proposing a local attention enhancement module to address the problem of the dispersal of linear self-attention scores that ignores local correlation, and supplementing the local attention with the convolution of the self-attention-like computation. operation of self-attention-like computation is utilized to supplement the local attention. Aiming at the increase of feature dimension in pyramid architecture, the bottleneck of linear self-attention computation is changed from sequence length to feature dimension, and the linear self-attention with compressed feature dimension is proposed. In addition, since multi-scale inputs are crucial for processing image information, this paper proposes a flexible and learnable dynamic multi-scale feature embedding module, which dynamically adjusts the weights of different scale features according to the input image for fusion. A large number of experiments on image classification and target detection tasks show that competitive results are achieved while reducing the computational effort.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Mingrui Zhang, Ronggui Wang, Juan Yang, and Lixia Xue "Efficient vision transformer for dynamic embedding of multiscale features", Proc. SPIE 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024), 132300G (16 August 2024); https://doi.org/10.1117/12.3035520

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
9 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Transformers

Convolution

Feature extraction

Image classification

Visual process modeling

Windows

Visualization

Show All Keywords

Keywords/Phrases

Search In:

Publication Years