SimMIM: a Simple Framework for Masked Image Modeling

🤖 Plain-English Summary

This paper presents SimMIM, a simple framework for masked image modeling. We also leverage this approach to address the data-hungry issue faced by large-scale model training, that a 3B model (Swin V2-G) is successfully trained to achieve advanced accuracy on four representative vision benchmarks using 40× less labelled data than that in previous practice (JFT-3B).

🔑 Key Findings

We have simplified recently proposed relevant approaches, without the need for special designs, such as block-wise masking and tokenization via discrete VAE or clustering.
To investigate what makes a masked image modeling task learn good representations, we systematically study the major components in our framework, and find that the simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task; 2) predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones.
Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📜 Copyright Notice: This page shows only metadata (title, authors, journal, date) and an original AI-generated summary. No abstract or full article text is copied. The original research is the intellectual property of its authors and publisher. ScienceTrace does not reproduce copyrighted content.

← More Artificial Intelligence All Research Articles

📋 Article Details

Category	🤖 Artificial Intelligence
Published	Jun 01, 2022
Journal	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Authors	Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao
DOI	10.1109/cvpr52688.2022.00943
Citations	1,146
Source	OpenAlex

🗂️ Research Categories

🤖 Artificial Intelligence 🧬 Medicine & Biology ⚛️ Physics & Space Science ⚙️ Engineering & Technology ∑ Mathematics