Video Swin Transformer

🤖 Plain-English Summary

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.

🔑 Key Findings

These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions.
In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.
The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📜 Copyright Notice: This page shows only metadata (title, authors, journal, date) and an original AI-generated summary. No abstract or full article text is copied. The original research is the intellectual property of its authors and publisher. ScienceTrace does not reproduce copyrighted content.

← More Artificial Intelligence All Research Articles

📋 Article Details

Category	🤖 Artificial Intelligence
Published	Jun 01, 2022
Journal	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Authors	Ze Liu, Ning Jia, Yue Cao, Yixuan Wei, Zheng Zhang
DOI	10.1109/cvpr52688.2022.00320
Citations	1,892
Source	OpenAlex

🗂️ Research Categories

🤖 Artificial Intelligence 🧬 Medicine & Biology ⚛️ Physics & Space Science ⚙️ Engineering & Technology ∑ Mathematics