Home / Research Library / Tokens-to-Token ViT: Training Vision Transformers...
🤖 Artificial Intelligence OpenAlex

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

📅 October 1, 2021 👤 Li Yuan, Yunpeng Chen, Tao Wang et al. 📖 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 📊 2,247 citations

🤖 Plain-English Summary

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384x384 on ImageNet.

🔑 Key Findings

  • The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification.
  • However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet.
  • We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📋 Article Details

Category 🤖 Artificial Intelligence
Published Oct 01, 2021
Journal 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Authors Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi
DOI 10.1109/iccv48922.2021.00060
Citations 2,247
Source OpenAlex

More 🤖 Artificial Intelligence Research