EfficientViT: Memory Efficient Vision Transformer with Casca...

AI-Generated Summary

Vision transformers have shown great success due to their high model capabilities. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$5.8\times/3.7\times$</tex> faster on the GPU/CPU, and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$7.4\times faster$</tex> when converted to ONNX format.

⚡ This is an original paraphrased summary — not copied from the abstract. Full paper available at the source link below.

Key Findings

1 However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications.
2 In this paper, we propose a family of high-speed vision transformers named Efficient ViT.
3 We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA.

Why It Matters

This work deepens our understanding of the fundamental laws governing the universe, from subatomic particles to cosmic structures.

This summary is based on publicly available metadata and abstract. For the full research paper, visit the original source:

Read Full Paper at OpenAlex

More Physics & Space Science Papers ← Back to Hub 📚 Learning Hub

Article Details

Source	OpenAlex
Category	⚛️ Physics & Space Science
Published	Jun 1, 2023
Journal	Research Journal
DOI	10.1109/cvpr52729.2023.01386
Citations	729
Authors	Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention