SuperGLUE Benchmarking Task
Jump to navigation
Jump to search
A SuperGLUE Benchmarking Task is a LLM inference evaluation task that can be used to test a model's general language understanding across challenging NLU tasks with deeper reasoning requirements.
- AKA: Super General Language Understanding Evaluation, SuperGLUE Benchmark.
- Context:
- Task Input : Structured language understanding prompts (e.g.: QA, coreference)
- Task Optional Input: Task definitions, instruction formats
- Task Output: Label or generated response (depending on task)
- Task Performance Measure/Metrics: Average score across 8 tasks (Accuracy, F1, etc.)
- Benchmark Datasets: https://super.gluebenchmark.com/tasks
- It can evaluate models on more linguistically nuanced and challenging NLU benchmarks than GLUE Benchmark.
- It can include tasks such as causal reasoning, coreference resolution, and question answering.
- ...
- Example(s):
- the benchmark task performed by Wang et al. (2019),
- T5 achieving state-of-the-art on SuperGLUE by reformulating all tasks as text-to-text generation.
- DeBERTa tested on SuperGLUE for multi-task generalization and contextual understanding.
- GPT-3 evaluated in few-shot settings using SuperGLUE task prompts.
- ...
- Counter-Example(s):
- GLUE Benchmarking Task, which is a predecessor with simpler tasks and shallower linguistic requirements.
- MMLU (Massive Multitask Language Understanding) Benchmark, which focuses on subject-specific reasoning, not general NLU.
- Closed QA Benchmarks like SQuAD, which test span extraction rather than general understanding.
- RTE Challenge (Bentivogli et al., 2017),
- Semantic Textual Similarity Benchmark (STS-B).
- BIG-Bench Hard (BBH) Benchmark.
- ...
- See: LLM Inference System, Multi-Task Learning, Natural Language Understanding, Natural Language Inference System, Lexical Entailment, Syntactic Parser, Morphological Analyzer, Word Sense Disambiguation, Lexical Semantic Relatedness, Logical Inference.
References
2022
- (Liang, Bommasani et al., 2022) ⇒ “Holistic Evaluation of Language Models.” doi:10.48550/arXiv.2211.09110
- QUOTE: ... As more general-purpose approaches to NLP grew, often displacing more bespoke task-specific approaches, new benchmarks such as SentEval (Conneau and Kiela, 2018), DecaNLP (McCann et al., 2018), GLUE (Wang et al., 2019b), and SuperGLUE (Wang et al., 2019a) co-evolved to evaluate their capabilities. In contrast to the previous class of benchmarks, these benchmarks assign each model a vector of scores to measure the accuracy for a suite of scenarios. In some cases, these benchmarks also provide an aggregate score (e.g. the GLUE score, which is the average of the accuracies for each of the constituent scenarios). ...
2019a
- (Wang, Pruksachatkun et al., 2019) ⇒ Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. (2019). “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.” In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). arXiv:1905.00537
- QUOTE: ... In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. ...
2019b
- (Superglue Benchmark, 2019) ⇒ https://super.gluebenchmark.com/ Retrieved:2019-09-15
- QUOTE: In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced one year ago, offered a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently come close to the level of non-expert humans, suggesting limited headroom for further research.
We take into account the lessons learnt from original GLUE benchmark and present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
- QUOTE: In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced one year ago, offered a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently come close to the level of non-expert humans, suggesting limited headroom for further research.