In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. Without bells-and-whistles, MViTv2 has advanced performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 AP <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">box</sup> on COCO object detection as well as 86.1% on Kinetics-400 video classification.
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Jun 01, 2022 |
| Journal | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
| Authors | Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong |
| DOI | 10.1109/cvpr52688.2022.00476 |
| Citations | 719 |
| Source | OpenAlex |