Home / Research Library / Training Compute-Optimal Large Language Models
🤖 Artificial Intelligence OpenAlex

Training Compute-Optimal Large Language Models

📅 March 29, 2022 👤 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. 📖 arXiv (Cornell University) 📊 663 citations

🤖 Plain-English Summary

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage.

🔑 Key Findings

  • We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
  • By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
  • We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data.

💡 Why This Matters

This research advances how AI systems learn, reason, and solve problems — with direct implications for automation and scientific discovery.

Read the full paper
Access the original peer-reviewed research via OpenAlex.

View on DOI ↗

📋 Article Details

Category 🤖 Artificial Intelligence
Published Mar 29, 2022
Journal arXiv (Cornell University)
Authors Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai
DOI 10.48550/arxiv.2203.15556
Citations 663
Source OpenAlex

More 🤖 Artificial Intelligence Research