Training Compute-Optimal Large Language Models

AI-Generated Summary

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage.

⚡ This is an original paraphrased summary — not copied from the abstract. Full paper available at the source link below.

Key Findings

1 We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
2 By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
3 We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data.

Why It Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

This summary is based on publicly available metadata and abstract. For the full research paper, visit the original source:

Read Full Paper at OpenAlex

More Artificial Intelligence Papers ← Back to Hub 📚 Learning Hub

Article Details

Source	OpenAlex
Category	🤖 Artificial Intelligence
Published	Mar 29, 2022
Journal	arXiv (Cornell University)
DOI	10.48550/arxiv.2203.15556
Citations	663
Authors	Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai