2020 UnsupervisedCrossLingualReprese

From GM-RKB
Jump to navigation Jump to search

Subject Headings: XLM Model, Transformer-based Masked Language Model, XLM-RoBERTa (XLM-R), Multilingual BERT (mBERT).

Notes

Cited By

Quotes

Abstract

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including + 14.6% average accuracy on XNLI, + 13% average F1 score on MLQA, and + 2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available. [1]

1 Introduction

The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised crosslingual representations at a very large scale. We present XLM-R a transformer-based multilingual masked language model pre-trained on text in 100 languages, which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering.

Multilingual masked language models (MLM) like mBERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019) have pushed the state-of-the-art on cross-lingual understanding tasks by jointly pretraining large Transformer models (Vaswani et al., 2017) on many languages. These models allow for effective cross-lingual transfer, as seen in a number of benchmarks including cross-lingual natural language inference (Bowman et al., 2015; Williams et al., 2017; Conneau et al., 2018), question answering (Rajpurkar et al., 2016; Lewis et al., 2019), and named entity recognition (Pires et al., 2019; Wu and Dredze, 2019). However, all of these studies pre-train on Wikipedia, which provides a relatively limited scale especially for lower resource languages.

In this paper, we first present a comprehensive analysis of the trade-offs and limitations of multilingual language models at scale, inspired by recent monolingual scaling efforts (Liu et al., 2019). We measure the trade-off between high-resource and low-resource languages and the impact of language sampling and vocabulary size. The experiments expose a trade-off as we scale the number of languages for a fixed model capacity: more languages leads to better cross-lingual performance on low-resource languages up until a point, after which the overall performance on monolingual and cross-lingual benchmarks degrades. We refer to this tradeoff as the curse of multilinguality, and show that it can be alleviated by simply increasing model capacity. We argue, however, that this remains an important limitation for future XLU systems which may aim to improve performance with more modest computational budgets.

Our best model XLM-RoBERTa (XLM-R) outperforms mBERT on cross-lingual classification by up to 23% accuracy on low-resource languages. It outperforms the previous state of the art by 5.1% average accuracy on XNLI, 2.42% average F1-score arXiv:1911.02116v2 [cs.CL] 8 Apr 2020 on Named Entity Recognition, and 9.1% average F1-score on cross-lingual Question Answering. We also evaluate monolingual fine tuning on the GLUE and XNLI benchmarks, where XLM-R obtains results competitive with state-of-the-art monolingual models, including RoBERTA (Liu et al., 2019). These results demonstrate, for the first time, that it is possible to have a single large model for all languages, without sacrificing per-language performance. We will make our code, models and data publicly available, with the hope that this will help research in multilingual NLP and low-resource language understanding.

2 Related Work

From pretrained word embeddings (Mikolov et al., 2013b; Pennington et al., 2014) to pretrained contextualized representations (Peters et al., 2018; Schuster et al., 2019) and transformer based language models (Radford et al., 2018; Devlin et al., 2018), unsupervised representation learning has significantly improved the state of the art in natural language understanding. Parallel work on cross-lingual understanding (Mikolov et al., 2013a; Schuster et al., 2019; Lample and Conneau, 2019) extends these systems to more languages and to the cross-lingual setting in which a model is learned in one language and applied in other languages.

Most recently, Devlin et al. (2018) and Lample and Conneau (2019) introduced BERT and XLM - masked language models trained on multiple languages, without any cross-lingual supervision. Lample and Conneau (2019) propose translation language modeling (TLM) as a way to leverage parallel data and obtain a new state of the art on the cross-lingual natural language inference (XNLI) benchmark (Conneau et al., 2018). They further show strong improvements on unsupervised machine translation and pretraining for sequence generation. Wu et al. (2019) shows that monolingual BERT representations are similar across languages, explaining in part the natural emergence of multilinguality in bottleneck architectures. Separately, Pires et al. (2019) demonstrated the effectiveness of multilingual models like mBERT on sequence labeling tasks. Huang et al. (2019) showed gains over XLM using cross-lingual multi-task learning, and Singh et al. (2019) demonstrated the efficiency of cross-lingual data augmentation for cross-lingual NLI. However, all of this work was at a relatively modest scale, in terms of the amount of training data, as compared to our approach.

The benefits of scaling language model pretraining by increasing the size of the model as well as the training data has been extensively studied in the literature. For the monolingual case, Jozefowicz et al. (2016) show how large-scale LSTM models can obtain much stronger performance on language modeling benchmarks when trained on billions of tokens. GPT (Radford et al., 2018) also highlights the importance of scaling the amount of data and RoBERTa (Liu et al., 2019) shows that training BERT longer on more data leads to significant boost in performance. Inspired by RoBERTa, we show that mBERT and XLM are undertuned, and that simple improvements in the learning procedure of unsupervised MLM leads to much better performance. We train on cleaned CommonCrawls (Wenzek et al., 2019), which increase the amount of data for low-resource languages by two orders of magnitude on average. Similar data has also been shown to be effective for learning high quality word embeddings in multiple languages (Grave et al., 2018).

Several efforts have trained massively multilingual machine translation models from large parallel corpora. They uncover the high and low resource trade-off and the problem of capacity dilution (Johnson et al., 2017; Tan et al., 2019). The work most similar to ours is Arivazhagan et al. (2019), which trains a single model in 103 languages on over 25 billion parallel sentences. Siddhant et al. (2019) further analyze the representations obtained by the encoder of a massively multilingual machine translation system and show that it obtains similar results to mBERT on cross-lingual NLI. Our work, in contrast, focuses on the unsupervised learning of cross-lingual representations and their transfer to discriminative tasks.

3 Model and Data

In this section, we present the training objective, languages, and data we use. We follow the XLM approach (Lample and Conneau, 2019) as closely as possible, only introducing changes that improve performance at scale.

...

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2020 UnsupervisedCrossLingualRepreseLuke Zettlemoyer
Alexis Conneau
Veselin Stoyanov
Myle Ott
Naman Goyal
Edouard Grave
Kartikay Khandelwal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Unsupervised Cross-lingual Representation Learning at Scale