LLM Scaling Law

Context:
- It can (often) involve relationships between the Model Size, Training Dataset Size, Training Cost, and Post-Training Performance.
- It can be with respect to Cross-Entropy Loss.
- It can include findings that the model size and the number of training tokens should be scaled equally for compute-optimal training.
- It can suggest that increasing model size shows diminishing returns and performance saturation, especially over 100 billion parameters.
- It can suggest that dataset size improvements also show diminishing benefits.
- It can suggest that optimal configurations balance model width, depth, batch size, and memory bandwidth depending on hardware.
- It can include guidelines for determining the optimal size of a model for a given quantity of compute.
- It can explore the interplay between model size, training data size, and computing when training large language models to find the most efficient balance.
- It can suggest that more research is required to understand further the complex relationships between model scale, data scale, and model quality across different tasks.
- ...
Example(s):
- Hoffmann et al., 2022's study on training compute-optimal large language models.
- ...
Counter-Example(s):
- on one Small LMs or Medium-sized LMs.
- ...
See: Large Language Model, Model Performance, Cross-Entropy Loss, Neural Network, Chinchilla LLM.

References