LLaMA: Open and Efficient Foundation Language Models

  • https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/
  • The main contribution of LLaMA is two-fold:
  • The focus of LLaMA is to train language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used.
    • Study by Kaplan(https://arxiv.org/abs/2001.08361) and Hoffmann(https://arxiv.org/abs/2203.15556) explored optimal ratio between the model size and the size of dataset under the particular compute budget.
    • For example, given compute budget 10, model : dataset = sqrt(10) : sqrt(10) might be better than model : dataset = 5 : 2.
    • However, optimally trained model would not be the best in inference, as the only model size matters at the inference phase.
    • So it might be better to use smaller model with larger training dataset if you consider deploying the model at the large scale.
  • LLaMA only use publicly available data, mkaing our work compatible with open-sourcing. (YAYYY!)
    • Previous work such as Chincilla, PaLM, or GPT-3 rely on data which is not publicly available or undocumented (e.g., Books-2TB)
    • While previous work such as OPT, GPT-Neo, BLOOM, and GLM use public data, their performance is not competitive with PaLM or Chinchilla.
Data mixtures that LLaMA used for pretraining
  • LLaMA used BPE algorithm using the implementation from SetencePiece.
    • Note that numbers are split into individual digits, and unknown UTF-8 chars are decomposed into bytes.
  • The entire training dataset contains roughly 1.4T tokens, and each token is used only once during training. (A few are used twice, though.)
Model configurations that LLaMA used
  • Deviation from the original transformer
    • Pre-normalization (from GPT-3): each input to the transformer layer is normalized by RMSNorm, instead of normalizing the output of the transformer layer.
    • SwiGLU activation function (from PaLM) is used instead of ReLU.
    • Rotary Embeddings (from GPT-Neo) are used instead of absolute positional embeddings.
  • Other options: AdamW optimizer, cosine learning rate schedule, weight decay, gradient clipping, warmup steps, varying learning rate and batch size with the size of the model. (See Table 2)
  • Implementation optimization
    • efficient implementation of the causal multi-head attention from xformers library(https://github.com/facebookresearch/xformers)
    • checkpoint activation
      • Manually implemented backward functions, instead of relying on the PyTorch autograd
    • Model and sequence palallelism from Korthikanti(https://arxiv.org/abs/2205.05198, which is Megatron-LM)
    • Comp-comm overlap
    • On 65B model, they achieved 380 tokens/sec/GPU with 2048 A100-80GB GPUs. Thus, it took ~21 days to train with 1.4T tokens.

Leave a comment

Your email address will not be published. Required fields are marked *