Distributional-based Subword Embedding Space

From GM-RKB
Jump to navigation Jump to search

A Distributional-based Subword Embedding Space is an text-item embedding space for subwords associated with a distributional subword embedding function (which maps to distributional subword vectors).



References

2019

  • (Zhang et al., 2019) ⇒ Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. (2019). “BioWordVec, Improving Biomedical Word Embeddings with Subword Information and MeSH.” Scientific data 6, no. 1
    • QUOTE: ... … Subsequently, we use the subword embedding model to learn the text sequences and MeSH term sequences in a unified n-gram embedding space. Our word embeddings are assessed for both validity and utility on multiple BioNLP tasks …

       Bojanowski et al., 2017 proposed fastText: a subword embedding model based on the skip-gram model1 that learns the character n-grams distributed embeddings using unlabeled corpora where each word is represented as the sum of the vector representations of its n-grams. Compared to the word2vec model1, the subword embedding model can make effective use of the subword information and internal word structure to improve the embedding quality. In the biomedical domain, many specialized compound words, such as “deltaproteobacteria”, are rare or OOV in the training corpora, thus making them difficult to learn properly using the word2vec model. In contrast, the subword embedding model is naturally more suitable to deal with such situations. For instance, since “delta”, “proteo” and “bacteria” are common in the training corpora, the subword embedding model can learn the distributed representations of all character n-grams of “deltaproteobacteria”, and subsequently integrate the subword vectors to create the final embedding of “deltaproteobacteria”. In this study, we apply the subword embedding model to learn word embeddings from the joint text sequences of PubMed and MeSH. …

2017