Sigmoid Loss for Language Image Pre-Training

🤖 Plain-English Summary

We propose a simple pairwise sigmoid loss for imagetext pre-training. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient.

🔑 Key Findings

Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization.
The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes.
With only four TPUv4 chips, we can train a Base CLIP model at 4k batch size and a Large LiT model at 20k batch size, the latter achieves 84.5% ImageNet zero-shot accuracy in two days.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📜 Copyright Notice: This page shows only metadata (title, authors, journal, date) and an original AI-generated summary. No abstract or full article text is copied. The original research is the intellectual property of its authors and publisher. ScienceTrace does not reproduce copyrighted content.

← More Artificial Intelligence All Research Articles

📋 Article Details

Category	🤖 Artificial Intelligence
Published	Oct 01, 2023
Journal	Research Journal
Authors	Xiaohua Zhai, Basil Mustafa, А. И. Колесников, Lucas Beyer
DOI	10.1109/iccv51070.2023.01100
Citations	614
Source	OpenAlex

🗂️ Research Categories

🤖 Artificial Intelligence 🧬 Medicine & Biology ⚛️ Physics & Space Science ⚙️ Engineering & Technology ∑ Mathematics