Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Oct 01, 2021 |
| Journal | 2021 IEEE/CVF International Conference on Computer Vision (ICCV) |
| Authors | Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra |
| DOI | 10.1109/iccv48922.2021.00180 |
| Citations | 673 |
| Source | OpenAlex |