Distributional-based Word/Token Embedding Space

From GM-RKB
Jump to navigation Jump to search

A Distributional-based Word/Token Embedding Space is a text-item embedding space for word vectors associated with a distributional word vectorizing function (which maps to distributional word vectors).



References

2018

  • (Wolf, 2018b) ⇒ Thomas Wolf. (2018). “The Current Best of Universal Word Embeddings and Sentence Embeddings." Blog post
    • QUOTE: Word and sentence embeddings have become an essential part of any Deep-Learning-based natural language processing systems. They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual data. A huge trend is the quest for Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models (sentimental analysis, classification, translation…) to automatically improve their performance by incorporating some general word/sentence representations learned on the larger dataset. It’s a form of transfer learning. Transfer learning has been recently shown to drastically increase the performance of NLP models on important tasks such as text classification. …

      … A wealth of possible ways to embed words have been proposed over the last five years. The most commonly used models are word2vec and GloVe which are both unsupervised approaches based on the distributional hypothesis (words that occur in the same contexts tend to have similar meanings). While several works augment these unsupervised approaches by incorporating the supervision of semantic or syntactic knowledge, purely unsupervised approaches have seen interesting developments in 2017–2018, the most notable being FastText (an extension of word2vec) and ELMo (state-of-the-art contextual word vectors).

2017a

   Subword-level embeddings
   OOV handling
   Evaluation
   Multi-sense embeddings
   Beyond words as points
   Phrases and multi-word expressions
   Bias
   Temporal dimension
   Lack of theoretical understanding
Task and domain-specific embeddings
   Transfer learning
   Embeddings for multiple languages
   Embeddings based on other contexts

2017

  • (Yang, Lu & Zheng, 2017) ⇒ Wei Yang, Wei Lu, and Vincent Zheng. (2017). “A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings.” In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2898-2904.
    • ABSTRACT: Learning word embeddings has received a significant amount of attention recently. Often, word embeddings are learned in an unsupervised manner from a large collection of text. The genre of the text typically plays an important role in the effectiveness of the resulting embeddings. How to effectively train word embedding models using data from different domains remains a problem that is underexplored. In this paper, we present a simple yet effective method for learning word embeddings based on text from different domains. We demonstrate the effectiveness of our approach through extensive experiments on various down-stream NLP tasks.