LLM Scaling Law
Jump to navigation
Jump to search
A LLM Scaling Law is a scaling law that can apply to a Large Language Model.
- Context:
- It can (often) involve relationships between the Model Size, Training Dataset Size, Training Cost, and Post-Training Performance.
- It can be with respect to Cross-Entropy Loss.
- It can include findings that the model size and the number of training tokens should be scaled equally for compute-optimal training.
- It can suggest that increasing model size shows diminishing returns and performance saturation, especially over 100 billion parameters.
- It can suggest that dataset size improvements also show diminishing benefits.
- It can suggest that optimal configurations balance model width, depth, batch size, and memory bandwidth depending on hardware.
- It can include guidelines for determining the optimal size of a model for a given quantity of compute.
- It can explore the interplay between model size, training data size, and computing when training large language models to find the most efficient balance.
- It can suggest that more research is required to understand further the complex relationships between model scale, data scale, and model quality across different tasks.
- ...
- Example(s):
- Hoffmann et al., 2022's study on training compute-optimal large language models.
- ...
- Counter-Example(s):
- on one Small LMs or Medium-sized LMs.
- ...
- See: Large Language Model, Model Performance, Cross-Entropy Loss, Neural Network, Chinchilla LLM.
References
2022
- (Hoffmann et al., 2022) ⇒ Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. (2022). "Training Compute-Optimal Large Language Models.” In: Advances in Neural Information Processing Systems, 35. doi:10.48550/arXiv.2203.15556
- QUOTE: We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. ...