LLaMA: Open and Efficient Foundation Language Models

https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/
The main contribution of LLaMA is two-fold:
The focus of LLaMA is to train language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used.
- Study by Kaplan(https://arxiv.org/abs/2001.08361) and Hoffmann(https://arxiv.org/abs/2203.15556) explored optimal ratio between the model size and the size of dataset under the particular compute budget.
- For example, given compute budget 10, model : dataset = sqrt(10) : sqrt(10) might be better than model : dataset = 5 : 2.
- However, optimally trained model would not be the best in inference, as the only model size matters at the inference phase.
- So it might be better to use smaller model with larger training dataset if you consider deploying the model at the large scale.
LLaMA only use publicly available data, mkaing our work compatible with open-sourcing. (YAYYY!)
- Previous work such as Chincilla, PaLM, or GPT-3 rely on data which is not publicly available or undocumented (e.g., Books-2TB)
- While previous work such as OPT, GPT-Neo, BLOOM, and GLM use public data, their performance is not competitive with PaLM or Chinchilla.

LLaMA used BPE algorithm using the implementation from SetencePiece.
- Note that numbers are split into individual digits, and unknown UTF-8 chars are decomposed into bytes.
The entire training dataset contains roughly 1.4T tokens, and each token is used only once during training. (A few are used twice, though.)

Deviation from the original transformer
- Pre-normalization (from GPT-3): each input to the transformer layer is normalized by RMSNorm, instead of normalizing the output of the transformer layer.
- SwiGLU activation function (from PaLM) is used instead of ReLU.
- Rotary Embeddings (from GPT-Neo) are used instead of absolute positional embeddings.
Other options: AdamW optimizer, cosine learning rate schedule, weight decay, gradient clipping, warmup steps, varying learning rate and batch size with the size of the model. (See Table 2)
Implementation optimization
- efficient implementation of the causal multi-head attention from xformers library(https://github.com/facebookresearch/xformers)
- checkpoint activation
  - Manually implemented backward functions, instead of relying on the PyTorch autograd
- Model and sequence palallelism from Korthikanti(https://arxiv.org/abs/2205.05198, which is Megatron-LM)
- Comp-comm overlap
- On 65B model, they achieved 380 tokens/sec/GPU with 2048 A100-80GB GPUs. Thus, it took ~21 days to train with 1.4T tokens.

Leave a comment Cancel reply