2020 ThePileAn800gbDatasetofDiverseT
- (Gao et al., 2020) ⇒ Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Noa Nabeshima, Shawn Presser, and Connor Leahy. (2020). “The Pile: An 800gb Dataset of Diverse Text for Language Modeling.” In: arXiv preprint arXiv:2101.00027.
Subject Headings: The Pile Dataset.
Notes
Cited By
2022
- (Black et al., 2022) ⇒ Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He et al. (2022). “Gpt-neox-20b: An Open-source Autoregressive Language Model.” arXiv preprint arXiv:2204.06745
- QUOTE: ... We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, ...
Quotes
Abstract
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2020 ThePileAn800gbDatasetofDiverseT | Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy | The Pile: An 800gb Dataset of Diverse Text for Language Modeling | 2020 |