SentEval Library

A SentEval Library is a sentence embedding evaluation system (for evaluating the quality of sentence embeddings in the context of a variety of downstream tasks).

Context:
- It can evaluate sentence embeddings across a broad set of tasks, including binary and multi-class classification, natural language inference, and sentence similarity.
- It can (typically) be used to assess how well sentence embeddings capture various linguistic properties and their usefulness for specific NLP tasks.
- It can (often) include both transfer tasks and probing tasks, where transfer tasks focus on more traditional NLP challenges and probing tasks aim to uncover what linguistic properties are encoded in the embeddings.
- It can provide scripts to download and preprocess datasets, offering an easy interface for evaluating sentence encoders.
- ...
Example(s):
- For transfer tasks, examples include movie review sentiment analysis, product review sentiment analysis, subjectivity status classification, and question type classification.
- For probing tasks, examples include sentence length prediction, word content analysis, and tree depth prediction.
Counter-Example(s):
- A task-specific NLP model evaluation tool that focuses on a single NLP task.
- A general machine learning benchmark that evaluates a wide range of machine learning algorithms beyond NLP.
See: Sentence Embeddings, Natural Language Processing, Downstream Task, Probing Task.

References

2024

https://github.com/facebookresearch/SentEval
- QUOTE: SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks. We also include a suite of 10 probing tasks which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.
- NOTES
  - SentEval Library is a library designed to evaluate the quality of sentence embeddings, leveraging them as features in a wide array of "transfer" tasks, including 17 downstream tasks and a suite of 10 probing tasks to analyze linguistic properties encoded in the embeddings``【oaicite:6】``.
  - It was updated to include new probing tasks for better assessment of linguistic properties encoded in sentence embeddings, reflecting its commitment to enhancing the comprehensiveness of evaluation``【oaicite:5】``.
  - It supports example scripts for three sentence encoders: SkipThought-LN, GenSen, and Google-USE, showcasing its adaptability to various sentence encoding methods``【oaicite:4】``.
  - It requires Python 2/3 with NumPy/SciPy, Pytorch (version 0.4 or later), and scikit-learn (version 0.18.0 or later) as dependencies, indicating its reliance on a Python-based ecosystem for machine learning``【oaicite:3】``.
  - It includes a diverse set of downstream tasks such as movie review sentiment analysis, product reviews, subjectivity status, and natural language inference among others, demonstrating its wide application in evaluating sentence embeddings across different contexts.
  - It comes with a series of probing tasks designed to evaluate specific linguistic properties encoded in sentence embeddings, such as sentence length prediction, word content analysis, and verb tense prediction, highlighting its focus on linguistic detail``【oaicite:1】``.
  - It requires the implementation of two functions (prepare and batcher) by the user to adapt SentEval for their specific sentence embeddings, emphasizing its flexible and customizable framework for embedding evaluation.

2022

(Liang, Bommasani et al., 2022) ⇒ “Holistic Evaluation of Language Models.” doi:10.48550/arXiv.2211.09110
- QUOTE: ... As more general-purpose approaches to NLP grew, often displacing more bespoke task-specific approaches, new benchmarks such as SentEval (Conneau and Kiela, 2018), DecaNLP (McCann et al., 2018), GLUE (Wang et al., 2019b), and SuperGLUE (Wang et al., 2019a) co-evolved to evaluate their capabilities. In contrast to the previous class of benchmarks, these benchmarks assign each model a vector of scores to measure the accuracy for a suite of scenarios. In some cases, these benchmarks also provide an aggregate score (e.g. the GLUE score, which is the average of the accuracies for each of the constituent scenarios). ...

2019

(Wang, Singh et al., 2019) ⇒ Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. (2019). “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In: Proceedings of the 7th International Conference on Learning Representations (ICLR 2019).
- QUOTE: ... In this line of work, a standard evaluation practice has emerged, recently codified as SentEval (Conneau et al., 2017; Conneau & Kiela, 2018). Like GLUE, SentEval relies on a set of existing classification tasks involving either one or two sentences as inputs. Unlike GLUE, SentEval only evaluates sentence-to-vector encoders, making it well-suited for evaluating models on tasks involving sentences in isolation. However, cross-sentence contextualization and alignment are instrumental in achieving state-of-the-art performance on tasks such as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017), question answering (Seo et al., 2017), and natural language inference (Rocktaschel et al., 2016). GLUE is designed to facilitate the development of these methods: It is model-agnostic, allowing for any kind of representation or contextualization, including models that use no explicit vector or symbolic representations for sentences whatsoever.
  GLUE also diverges from SentEval in the selection of evaluation tasks that are included in the suite. Many of the SentEval tasks are closely related to sentiment analysis, such as MR (Pang & Lee, 2005), SST (Socher et al., 2013), CR (Hu & Liu, 2004), and SUBJ (Pang & Lee, 2004). Other tasks are so close to being solved that evaluation on them is relatively uninformative, such as MPQA (Wiebe et al., 2005) and TREC question classification (Voorhees et al., 1999). In GLUE, we attempt to construct a benchmark that is both diverse and difficult. ...

2018a

(Conneau & Kiela, 2018) ⇒ Alexis Conneau and Douwe Kiela. "SentEval: An evaluation toolkit for universal sentence representations". In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018.

SentEval Library

References

2024

2022

2019

2018a

Navigation menu

Search