Swin Transformer V2: Scaling Up Capacity and Resolution

🤖 Plain-English Summary

We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. Using these techniques and self-supervised pre-training, we suc-cessfully train a strong 3 billion Swin Transformer model and effectively transfer it to various vision tasks involving high-resolution images or windows, achieving the advanced accuracy on a variety of benchmarks.

🔑 Key Findings

By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet- V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification.
We tackle issues of training instability, and study how to effectively transfer models pre-trained at low resolutions to higher resolution ones.
To this aim, several novel technologies are proposed: 1) a residual post normalization technique and a scaled cosine attention approach to improve the stability of large vision models; 2) a log-spaced continuous position bias technique to effectively transfer models pre-trained at low-resolution images and windows to their higher-resolution counterparts.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📜 Copyright Notice: This page shows only metadata (title, authors, journal, date) and an original AI-generated summary. No abstract or full article text is copied. The original research is the intellectual property of its authors and publisher. ScienceTrace does not reproduce copyrighted content.

← More Artificial Intelligence All Research Articles

📋 Article Details

Category	🤖 Artificial Intelligence
Published	Jun 01, 2022
Journal	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Authors	Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie
DOI	10.1109/cvpr52688.2022.01170
Citations	2,188
Source	OpenAlex

🗂️ Research Categories

🤖 Artificial Intelligence 🧬 Medicine & Biology ⚛️ Physics & Space Science ⚙️ Engineering & Technology ∑ Mathematics