AI2 Dolma (2023)

From GM-RKB
Jump to navigation Jump to search

A AI2 Dolma (2023) is a very-large text corpus dataset produced by AI2.



References

2023

  • https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64
    • NOTES:
      • Dolma is a 3 trillion token open dataset released by the Allen Institute for AI (AI2) in August 2022.
      • It contains text from diverse sources including the web, academic publications, code, books, and encyclopedias.
      • The goal is to use Dolma to train the Allen Institute's open language model called OLMo.
      • Dolma aims to be the largest open dataset for language model pretraining to date.
      • It was created with principles of openness, representativeness, size, reproducibility, and risk mitigation in mind.
      • The data has gone through source-specific and general processing including deduplication, English-only filtering, risk mitigation, and adding a small fraction of code.
      • Dolma is released under AI2's ImpACT license which requires stating intended use cases and disclosing derivatives.