Home / Research Articles Hub / Tokens-to-Token ViT: Training Vision Transformers...
🤖 Artificial Intelligence OpenAlex

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

📅 Published: October 1, 2021 👤 Li Yuan, Yunpeng Chen, Tao Wang et al. 📖 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 📊 2,247 citations
AI-Generated Summary

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384x384 on ImageNet.

⚡ This is an original paraphrased summary — not copied from the abstract. Full paper available at the source link below.

Key Findings
  • 1 The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification.
  • 2 However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet.
  • 3 We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples.
Why It Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

This summary is based on publicly available metadata and abstract. For the full research paper, visit the original source:

Read Full Paper at OpenAlex
More Artificial Intelligence Papers ← Back to Hub 📚 Learning Hub
Article Details
Source OpenAlex
Category 🤖 Artificial Intelligence
Published Oct 1, 2021
Journal 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
DOI 10.1109/iccv48922.2021.00060
Citations 2,247
Authors Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi