Transformer-based Character-Level Language Model (LM)

From GM-RKB
Jump to navigation Jump to search

A Transformer-based Character-Level Language Model (LM) is a neural character-level LM that is a transformer-based neural model.



References

2019a

2019b

2018

  • (Al-Rfou et al., 2018) ⇒ Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. (2018). “Character-Level Language Modeling with Deeper Self-Attention.” In: CoRR, abs/1808.04444.
    • QUOTE: ... In this paper, we show that a non-recurrent model can achieve strong results on character-level language modeling.

      Specifically, we use a deep network of transformer self-attention layers (Vaswani et al. 2017) with causal (backward-looking) attention to process fixed-length inputs and predict upcoming characters. The model is trained on mini-batches of sequences from random positions in the training corpus, with no information passed from one batch to the next.

      Our primary finding is that the transformer architecture is well-suited to language modeling over long sequences and could replace RNNs in this domain. We speculate that the transformer’s success here is due to its ability to “quickly” propagate information over arbitrary distances; by comparison, RNNs need to learn to pass relevant information forward step by step.

      We also find that some modifications to the basic transformer architecture are beneficial in this domain. Most importantly, we add three auxiliary losses, requiring the model to predict upcoming characters (i) at intermediate sequence positions, (ii) from intermediate hidden representations, and (iii) at target positions multiple steps in the future. These losses speed up convergence, and make it possible to train deeper networks.