Home / Research Library / Video Swin Transformer
🤖 Artificial Intelligence OpenAlex

Video Swin Transformer

📅 June 1, 2022 👤 Ze Liu, Ning Jia, Yue Cao et al. 📖 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 📊 1,892 citations

🤖 Plain-English Summary

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.

🔑 Key Findings

  • These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions.
  • In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.
  • The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📋 Article Details

Category 🤖 Artificial Intelligence
Published Jun 01, 2022
Journal 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Authors Ze Liu, Ning Jia, Yue Cao, Yixuan Wei, Zheng Zhang
DOI 10.1109/cvpr52688.2022.00320
Citations 1,892
Source OpenAlex

More 🤖 Artificial Intelligence Research