Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the advanced, while enjoying faster inference speed.
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Jul 16, 2021 |
| Journal | arXiv (Cornell University) |
| Authors | Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong |
| DOI | 10.48550/arxiv.2107.07651 |
| Citations | 822 |
| Source | OpenAlex |