Training Compute-Optimal Large Language Models

Kaplan et al. showed that there is a relationship between the number of params in LLM and its performance.
- Notably, they also suggested that large models should not be trained to their lowest possible loss to be compute optimal.
- For example, given 10x computational budget, the size of model should increase 5.5x and the number of training tokens should only increase 1.8x. (Note that 1.8 * 5.5 = 10)
However, the authors found that the model size and the number of training tokens should be scaled in equal proportions.
- Roughly saying, given 100x computational budget, model size should increase 10x and the number of training tokens should increase 10x.

Based on the observation, this paper presents Chinchilla which outperforms previous work but has less parameters.
- The main reason is that Chinchilla is trained with much more tokens.

Leave a comment Cancel reply