Home / Research Articles Hub / Video Swin Transformer
🤖 Artificial Intelligence OpenAlex

Video Swin Transformer

📅 Published: June 1, 2022 👤 Ze Liu, Ning Jia, Yue Cao et al. 📖 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 📊 1,892 citations
AI-Generated Summary

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.

⚡ This is an original paraphrased summary — not copied from the abstract. Full paper available at the source link below.

Key Findings
  • 1 These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions.
  • 2 In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.
  • 3 The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.
Why It Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

This summary is based on publicly available metadata and abstract. For the full research paper, visit the original source:

Read Full Paper at OpenAlex
More Artificial Intelligence Papers ← Back to Hub 📚 Learning Hub
Article Details
Source OpenAlex
Category 🤖 Artificial Intelligence
Published Jun 1, 2022
Journal 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
DOI 10.1109/cvpr52688.2022.00320
Citations 1,892
Authors Ze Liu, Ning Jia, Yue Cao, Yixuan Wei, Zheng Zhang