We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Jun 15, 2021 |
| Journal | arXiv (Cornell University) |
| Authors | Hangbo Bao, Dong Li, Piao, Songhao, Wei, Furu |
| DOI | 10.48550/arxiv.2106.08254 |
| Citations | 926 |
| Source | OpenAlex |