2021 MT5AMassivelyMultilingualPreTra

From GM-RKB
Jump to navigation Jump to search

Subject Headings: mT5 LLM, Multilingual LLM, mC4 Corpus.

Notes

  • The paper introduces mT5, a multilingual variant of the T5 model, pre-trained on a newly developed Common Crawl-based dataset covering 101 languages, which is named mC4.
  • The objective behind mT5 is to offer a massively multilingual model that closely follows the original T5 design, thereby inheriting all its advantages, such as its general-purpose text-to-text format, insights from empirical studies, and scalability.
  • The mC4 dataset is an extended version of the C4 dataset, designed to include natural text across 101 languages, derived from the public Common Crawl web scrape, with modifications made to ensure data quality and relevance.
  • mT5's performance was validated on multiple multilingual benchmarks, where it demonstrated state-of-the-art results in many cases. The paper also explores the issue of "accidental translation" during zero-shot tasks and proposes a simple technique to mitigate it.
  • mT5 employs a text-to-text framework for all NLP tasks, using an encoder-decoder Transformer architecture. It was trained using a masked language modeling "span-corruption" objective, replacing consecutive spans of input tokens with a mask token and predicting these masked tokens.
  • The paper discusses the importance of choosing a sampling strategy for training on data from multiple languages. To boost lower-resource languages, the authors sample examples based on the probability proportional to the size of the language dataset raised to the power of a hyperparameter α.
  • The mT5 vocabulary size was increased to 250,000 wordpieces to accommodate over 100 languages, with adjustments made for languages with large character sets through the use of SentencePiece's "byte-fallback" feature.
  • The authors compare mT5 to other massively multilingual pre-trained language models, highlighting its unique position in terms of architecture, parameter count, language coverage, and data source.
  • The paper details experiments on the XTREME multilingual benchmark, showing that mT5 models of various sizes exceed or approach state-of-the-art performance across different NLP tasks and languages, underscoring the benefits of scaling up a simple pre-training recipe.

Cited By

Quotes

Abstract

The recent " Text-to-Text Transfer Transformer " (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent " accidental translation " in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2021 MT5AMassivelyMultilingualPreTraRami Al-Rfou
Noah Constant
Adam Roberts
Colin Raffel
Aditya Siddhant
Linting Xue
Mihir Kale
Aditya Barua
mT5: A Massively Multilingual Pre-trained Text-to-text Transformer10.48550/arXiv.2010.119342021