Text Item Encoder

From GM-RKB
Jump to navigation Jump to search

A Text Item Encoder is a neural encoder for text items.



References

2021

2020

  • (Diao et al., 2020) ⇒ Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yonggang Wang. (2020). “ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations.” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings.
    • ABSTRACT: The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose ZEN, a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations, where different combinations of characters are considered during training. As a result, potential word or phase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). Therefore ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. Experimental results illustrated the effectiveness of ZEN on a series of Chinese NLP tasks. We show that ZEN, using less resource than other published encoders, can achieve state-of-the-art performance on most tasks. Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data. The code and pre-trained models of ZEN are available at this https URL.
    • QUOTE: Pre-trained text encoders (Peters et al., 2018b; Devlin et al., 2018; Radford et al., 2018, 2019; Yang et al., 2019) have drawn much attention in natural language processing (NLP), because state-of-the-art performance can be obtained for many NLP tasks using such encoders. In general, these [Text Encoder|encoder]]s are implemented by training a deep neural model on large unlabeled corpora. Although the use of big data brings success to these pre-trained encoders, it is still unclear whether existing encoders have effectively leveraged all useful information in the corpus. Normally, the pre-training procedures are designed to learn on tokens corresponding to small units of texts (e.g., word pieces for English, characters for Chinese) for efficiency and simplicity. However, some important information carried by larger text units may be lost for certain languages when we use a standard encoder, such as BERT. For example, in Chinese, text semantics are greatly affected by recognizing valid n-grams1. This means a pre-trained encoder can potentially be improved by incorporating such boundary information of important n-grams. ...

2019