https://arxiv.org/abs/2203.15556
- Kaplan et al. showed that there is a relationship between the number of params in LLM and its performance.
- Notably, they also suggested that large models should not be trained to their lowest possible loss to be compute optimal.
- For example, given 10x computational budget, the size of model should increase 5.5x and the number of training tokens should only increase 1.8x. (Note that 1.8 * 5.5 = 10)
- However, the authors found that the model size and the number of training tokens should be scaled in equal proportions.
- Roughly saying, given 100x computational budget, model size should increase 10x and the number of training tokens should increase 10x.

- Based on the observation, this paper presents Chinchilla which outperforms previous work but has less parameters.
- The main reason is that Chinchilla is trained with much more tokens.
