Training Compute-Optimal Large Language Models

https://arxiv.org/abs/2203.15556

  • Kaplan et al. showed that there is a relationship between the number of params in LLM and its performance.
    • Notably, they also suggested that large models should not be trained to their lowest possible loss to be compute optimal.
    • For example, given 10x computational budget, the size of model should increase 5.5x and the number of training tokens should only increase 1.8x. (Note that 1.8 * 5.5 = 10)
  • However, the authors found that the model size and the number of training tokens should be scaled in equal proportions.
    • Roughly saying, given 100x computational budget, model size should increase 10x and the number of training tokens should increase 10x.
Note that Kaplan’s projection was a bit steeper than this paper suggests.
  • Based on the observation, this paper presents Chinchilla which outperforms previous work but has less parameters.
    • The main reason is that Chinchilla is trained with much more tokens.
Size and tokens of recent LLMs. Note that Chinchilla has much smaller size : tokens ratio

Leave a comment

Your email address will not be published. Required fields are marked *