Paper
23 May 2023 Focal ViT: image transformer catches up with CNN on small datasets
Bin Chen, Xin Feng
Author Affiliations +
Proceedings Volume 12645, International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2023); 1264519 (2023) https://doi.org/10.1117/12.2681103
Event: International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2023), 2023, Hangzhou, China
Abstract
Recent advances of transformers have brought new trust to computer vision tasks. However, on small dataset, transformers is hard to train and have lower performance compared to convolutional neural networks. We make vision transformers as data efficient as convolutional neural networks by introducing focal attention. Inspired by the local attention networks, we constrained the self-attention of ViT to have multi-scale localized receptive field. We provide empirical evidence that proper constrain of receptive field reduce the amount of training data for vision transformers. Our best model reaches 83.16% accuracy when training from scratch on CIFAR-100 which is a significant improvement in data efficiency over the previous transformer. We also perform analysis on ImageNet to show our method does not lose accuracy on large datasets.
© (2023) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Bin Chen and Xin Feng "Focal ViT: image transformer catches up with CNN on small datasets", Proc. SPIE 12645, International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2023), 1264519 (23 May 2023); https://doi.org/10.1117/12.2681103
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Visual process modeling

Data modeling

Computer vision technology

Yield improvement

Back to Top