- https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/
- The main contribution of LLaMA is two-fold:
- The focus of LLaMA is to train language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used.
- Study by Kaplan(https://arxiv.org/abs/2001.08361) and Hoffmann(https://arxiv.org/abs/2203.15556) explored optimal ratio between the model size and the size of dataset under the particular compute budget.
- For example, given compute budget 10, model : dataset = sqrt(10) : sqrt(10) might be better than model : dataset = 5 : 2.
- However, optimally trained model would not be the best in inference, as the only model size matters at the inference phase.
- So it might be better to use smaller model with larger training dataset if you consider deploying the model at the large scale.
- LLaMA only use publicly available data, mkaing our work compatible with open-sourcing. (YAYYY!)
- Previous work such as Chincilla, PaLM, or GPT-3 rely on data which is not publicly available or undocumented (e.g., Books-2TB)
- While previous work such as OPT, GPT-Neo, BLOOM, and GLM use public data, their performance is not competitive with PaLM or Chinchilla.
- LLaMA used BPE algorithm using the implementation from SetencePiece.
- Note that numbers are split into individual digits, and unknown UTF-8 chars are decomposed into bytes.
- The entire training dataset contains roughly 1.4T tokens, and each token is used only once during training. (A few are used twice, though.)
- Deviation from the original transformer
- Pre-normalization (from GPT-3): each input to the transformer layer is normalized by RMSNorm, instead of normalizing the output of the transformer layer.
- SwiGLU activation function (from PaLM) is used instead of ReLU.
- Rotary Embeddings (from GPT-Neo) are used instead of absolute positional embeddings.
- Other options: AdamW optimizer, cosine learning rate schedule, weight decay, gradient clipping, warmup steps, varying learning rate and batch size with the size of the model. (See Table 2)
- Implementation optimization
- efficient implementation of the causal multi-head attention from xformers library(https://github.com/facebookresearch/xformers)
- checkpoint activation
- Manually implemented backward functions, instead of relying on the PyTorch autograd
- Model and sequence palallelism from Korthikanti(https://arxiv.org/abs/2205.05198, which is Megatron-LM)
- Comp-comm overlap
- On 65B model, they achieved 380 tokens/sec/GPU with 2048 A100-80GB GPUs. Thus, it took ~21 days to train with 1.4T tokens.