MobileViT: Light-weight, General-purpose, and Mobile-friendl...

🤖 Plain-English Summary

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile\nvision tasks. On the ImageNet-1k dataset,\nMobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters,\nwhich is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT\n(ViT-based) for a similar number of parameters.

🔑 Key Findings

Their spatial inductive biases allow them to learn\nrepresentations with fewer parameters across different vision tasks.
However,\nthese networks are spatially local.
To learn global representations,\nself-attention-based vision trans-formers (ViTs) have been adopted.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📜 Copyright Notice: This page shows only metadata (title, authors, journal, date) and an original AI-generated summary. No abstract or full article text is copied. The original research is the intellectual property of its authors and publisher. ScienceTrace does not reproduce copyrighted content.

← More Artificial Intelligence All Research Articles

📋 Article Details

Category	🤖 Artificial Intelligence
Published	Oct 05, 2021
Journal	arXiv (Cornell University)
Authors	Sachin Mehta, Mohammad Rastegari
DOI	10.48550/arxiv.2110.02178
Citations	734
Source	OpenAlex

🗂️ Research Categories

🤖 Artificial Intelligence 🧬 Medicine & Biology ⚛️ Physics & Space Science ⚙️ Engineering & Technology ∑ Mathematics

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision\n Transformer

🤖 Plain-English Summary

🔑 Key Findings

💡 Why This Matters

📋 Article Details

🗂️ Research Categories

🔗 Related Resources

More 🤖 Artificial Intelligence Research