Text Item Encoder

A Text Item Encoder is a neural encoder for text items.

Context:
- It can range from being a Character Encoder, Subword Encoder, Word Encoder, Phrase Encoder, Sentence Encoder, Passage Encoder, Document Encoder.
- …
Example(s):
- a English Text Encoder, Chinese Text Encoder, ...
- an RNN-based Text Encoder, ...
- a Sentence Encoder, such as https://tfhub.dev/google/collections/universal-sentence-encoder/1
- …
Counter-Example(s):
- an Image Encoder.
- a Text Decoder.
See: Text Embedding, Word Embedding.

References

2021

https://www.baeldung.com/cs/nlp-encoder-decoder-models
- QUOTE: ... In general, a text encoder turns text into a numeric representation. This task can be implemented in many different ways but, in this tutorial, what we mean by encoders are RNN encoders ...

2020

(Diao et al., 2020) ⇒ Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yonggang Wang. (2020). “ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations.” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings.
- ABSTRACT: The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose ZEN, a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations, where different combinations of characters are considered during training. As a result, potential word or phase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). Therefore ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. Experimental results illustrated the effectiveness of ZEN on a series of Chinese NLP tasks. We show that ZEN, using less resource than other published encoders, can achieve state-of-the-art performance on most tasks. Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data. The code and pre-trained models of ZEN are available at this https URL.
- QUOTE: Pre-trained text encoders (Peters et al., 2018b; Devlin et al., 2018; Radford et al., 2018, 2019; Yang et al., 2019) have drawn much attention in natural language processing (NLP), because state-of-the-art performance can be obtained for many NLP tasks using such encoders. In general, these [Text Encoder|encoder]]s are implemented by training a deep neural model on large unlabeled corpora. Although the use of big data brings success to these pre-trained encoders, it is still unclear whether existing encoders have effectively leveraged all useful information in the corpus. Normally, the pre-training procedures are designed to learn on tokens corresponding to small units of texts (e.g., word pieces for English, characters for Chinese) for efficiency and simplicity. However, some important information carried by larger text units may be lost for certain languages when we use a standard encoder, such as BERT. For example, in Chinese, text semantics are greatly affected by recognizing valid n-grams1. This means a pre-trained encoder can potentially be improved by incorporating such boundary information of important n-grams. ...

2019

https://towardsdatascience.com/use-cases-of-googles-universal-sentence-encoder-in-production-dd5aaab4fc15
- QUOTE: ... The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained Universal Sentence Encoder is publicly available in [Tensorflow-hub]. It comes with two variations i.e. one trained with Transformer encoder and other trained with Deep Averaging Network (DAN). The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy. ...

Text Item Encoder

References

2021

2020

2019

Navigation menu

Search