Tokens-to-Token ViT: Training Vision Transformers from Scrat...

🤖 Plain-English Summary

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384x384 on ImageNet.

🔑 Key Findings

The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification.
However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet.
We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📜 Copyright Notice: This page shows only metadata (title, authors, journal, date) and an original AI-generated summary. No abstract or full article text is copied. The original research is the intellectual property of its authors and publisher. ScienceTrace does not reproduce copyrighted content.

← More Artificial Intelligence All Research Articles

📋 Article Details

Category	🤖 Artificial Intelligence
Published	Oct 01, 2021
Journal	2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Authors	Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi
DOI	10.1109/iccv48922.2021.00060
Citations	2,247
Source	OpenAlex

🗂️ Research Categories

🤖 Artificial Intelligence 🧬 Medicine & Biology ⚛️ Physics & Space Science ⚙️ Engineering & Technology ∑ Mathematics

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

🤖 Plain-English Summary

🔑 Key Findings

💡 Why This Matters

📋 Article Details

🗂️ Research Categories

🔗 Related Resources

More 🤖 Artificial Intelligence Research