We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. The emergent capabilities improve with the strength of the image encoder and we set a new advanced on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models.
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Jun 01, 2023 |
| Journal | Research Journal |
| Authors | Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala |
| DOI | 10.1109/cvpr52729.2023.01457 |
| Citations | 701 |
| Source | OpenAlex |