Home / Research Library / Emerging Properties in Self-Supervised Vision Tran...
🤖 Artificial Intelligence OpenAlex

Emerging Properties in Self-Supervised Vision Transformers

📅 October 1, 2021 👤 Mathilde Caron, Hugo Touvron, Ishan Misra et al. 📖 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 📊 4,980 citations

🤖 Plain-English Summary

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.

🔑 Key Findings

  • Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.
  • Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT.
  • Our study also underlines the importance of momentum encoder , multi-crop training , and the use of small patches with ViTs.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📋 Article Details

Category 🤖 Artificial Intelligence
Published Oct 01, 2021
Journal 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Authors Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jeǵou, Julien Mairal
DOI 10.1109/iccv48922.2021.00951
Citations 4,980
Source OpenAlex

More 🤖 Artificial Intelligence Research