Home / Research Library / Frozen in time: A joint video and image encoder fo...
🤖 Artificial Intelligence OpenAlex

Frozen in time: A joint video and image encoder for end-to-end retrieval

📅 January 1, 2022 👤 Zisserman, A, Arsha Nagrani, Gül Varol et al. 📖 Oxford University Research Archive (ORA) (University of Oxford) 📊 752 citations

🤖 Plain-English Summary

Our objective in this work is video-text retrieval – in particular a joint embedding that enables efficient text-to-video retrieval. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet.

🔑 Key Findings

  • The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute.We address both these challenges in this paper.
  • We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
  • Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View Original ↗

📋 Article Details

Category 🤖 Artificial Intelligence
Published Jan 01, 2022
Journal Oxford University Research Archive (ORA) (University of Oxford)
Authors Zisserman, A, Arsha Nagrani, Gül Varol, Bain, M
Citations 752
Source OpenAlex

More 🤖 Artificial Intelligence Research