Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384x384 on ImageNet.
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Oct 01, 2021 |
| Journal | 2021 IEEE/CVF International Conference on Computer Vision (ICCV) |
| Authors | Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi |
| DOI | 10.1109/iccv48922.2021.00060 |
| Citations | 2,247 |
| Source | OpenAlex |