MT-Bench

A MT-Bench is a LLM inference evaluation task that can be used to assess the multi-turn conversational and instruction-following capabilities of large language models (LLMs) through a curated set of open-ended prompts and automated grading by a strong LLM judge.

AKA: Multi-Turn Benchmark, LMSYS MT-Bench.
Context:
- Task Input: Multi-turn user prompt (e.g., instruction or question, possibly as part of a dialogue).
- Optional Input: System instruction, prior dialogue history, or conversational context.
- Task Output: Natural language response generated by the model for each prompt.
- Task Performance Measure/Metrics: GPT-4-based scoring (1–10), pairwise model preference, agreement with human ratings, and category-specific scores.
- Benchmark Dataset: https://github.com/lm-sys/FastChat
  - MT-Bench datasets consist of 80 hand-crafted multi-turn prompts across 8 instruction categories: Writing, Roleplay, Extraction, Reasoning, Math, Coding, STEM, Humanities).
- It can take a multi-turn user prompt, optionally with system instructions or prior dialogue context, and generate a model response.
- It can evaluate the quality of the response using performance measures such as GPT-4-based single-answer grading on a 10-point scale, pairwise comparisons, and category-specific breakdowns.
- It can assess models across 8 categories: Writing, Roleplay, Extraction, Reasoning, Math, Coding, STEM, and Humanities.
- It can utilize GPT-4 as an automated judge, achieving over 80% agreement with human preferences.
- It can serve as a scalable and explainable method to approximate human evaluations of LLM outputs.
Example(s):
- GPT-4 evaluated using MT-Bench, achieving a high average score across categories.
- Vicuna-13B tested on MT-Bench to assess its conversational abilities in multi-turn settings.
- Claude-v1 benchmarked with MT-Bench to compare its performance against other models.
- ...
Counter-Example(s):
- MMLU Benchmarking Task, which focuses on multiple-choice questions rather than open-ended, multi-turn dialogues.
- SQuAD Benchmarking Task, which evaluates extractive question answering without multi-turn interactions.
- GLUE Benchmarking Task, which assesses sentence-level classification tasks, not conversational abilities.
- ...
See: LLM Inference Evaluation Task, Chatbot Arena, GPT-4, Vicuna, FastChat.

References

2023a

(LMSYS, 2023) ⇒ LMSYS. (2023). "The Chatbot Arena Leaderboard". In: LMSYS Blog.
- QUOTE: The Chatbot Arena Leaderboard ranks large language models (LLMs) based on user votes in a crowdsourced, competitive setting.
  This leaderboard is generated from over 1.5 million human votes collected via the Chatbot Arena.

2023a

(Zheng et al., 2023) ⇒ Zheng, L., Chiang, W. L., Zhang, S., Zheng, Y., Zhuang, S., Wei, J., ... & Gonzalez, J. E. (2023). "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena". In: arXiv preprint, arXiv:2306.05685.
- QUOTE: We introduce MT-bench, a set of challenging multi-turn questions, and evaluate models by prompting strong LLMs like GPT-4 to act as judges.
  We find that LLM-as-a-judge correlates well with human preferences, and can be a cost-effective and scalable alternative to human evaluation.

2023c

(LMSYS, 2023) ⇒ LMSYS. (2023). "MT-Bench". In: Hugging Face Spaces.
- QUOTE: MT-Bench is a challenging multi-turn question set for evaluating chatbots.
  It consists of questions spanning different categories to assess various aspects of model capabilities.

2023d

(LMSYS, 2023c) ⇒ LMSYS. (2023). "FastChat: An Open Platform for Training, Serving, and Evaluating LLM-based Chatbots". In: _GitHub Repository_.
- QUOTE: FastChat is an open platform for training, serving, and evaluating large language model based chatbots.
  FastChat powers Chatbot Arena, serving over 10 million chat requests for 70+ LLMs.
  Chatbot Arena has collected over 1.5M human votes from side-by-side LLM battles to compile an online LLM Elo leaderboard.

MT-Bench

References

2023a

2023a

2023c

2023d

Navigation menu

Search