Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection.
This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.
Read the full paper
Access the original peer-reviewed research via OpenAlex.
| Category | 🤖 Artificial Intelligence |
| Published | Oct 16, 2022 |
| Journal | arXiv (Cornell University) |
| Authors | Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman |
| DOI | 10.48550/arxiv.2210.08402 |
| Citations | 1,037 |
| Source | OpenAlex |