Switch Transformer Architecture

From GM-RKB
Jump to navigation Jump to search

A Switch Transformer Architecture is an MoE transformer network architecture that ...

  • Context:
    • It can (typically) select different parameters for processing different inputs.
    • It can facilitate scaling to Deep Learning Networks with trillions of parameters.
    • It can aim to improve efficiency by distilling sparse pre-trained and fine-tuned models into smaller, dense models.
    • It can reduce the model size by up to 99% while retaining around 30% of the quality gains of the larger, sparse models.
    • It can employ selective precision training, enhancing both efficiency and effectiveness.
    • ...
  • Example(s):
  • Counter-Example(s):
    • Traditional Transformer models that do not utilize the Mixture of Experts approach.
    • Smaller-scale AI models with a fixed set of parameters for all inputs.
  • See: Mixture of Experts, Language Model, Sparsely-Gated MoE.


References

2021