MT-Bench
A MT-Bench is a multi-turn LLM inference evaluation task that assesses multi-turn conversational capabilities and instruction-following capabilities of large language models through curated multi-turn prompt sets and automated LLM-based grading by strong LLM judges.
- AKA: Multi-Turn Benchmark, LMSYS MT-Bench, MT-Bench Evaluation Task, Multi-Turn LLM Evaluation Benchmark.
- Context:
- It can typically evaluate multi-turn LLM response quality through GPT-4-based scoring on a 10-point scale.
- It can typically assess multi-turn instruction-following capabilities across eight evaluation categories.
- It can typically measure multi-turn conversational coherence through dialogue context preservation.
- It can often achieve human-LLM agreement exceeding 80% with human preference judgments.
- It can often support multi-turn model comparison through pairwise preference evaluation.
- It can often enable category-specific multi-turn performance analysis through domain-segregated scoring.
- It can utilize 80 hand-crafted multi-turn prompts spanning diverse instruction domains.
- It can evaluate across Writing Tasks, Roleplay Tasks, Extraction Tasks, Reasoning Tasks, Math Tasks, Coding Tasks, STEM Tasks, and Humanities Tasks.
- It can provide explainable multi-turn evaluation through LLM-generated scoring rationales.
- It can facilitate scalable multi-turn LLM assessment without requiring extensive human annotation.
- It can serve as a cost-effective multi-turn evaluation alternative to human evaluation studies.
- It can range from being a Single-Model MT-Bench Evaluation to being a Comparative MT-Bench Evaluation, depending on its evaluation mode.
- It can range from being a Category-Specific MT-Bench Assessment to being a Comprehensive MT-Bench Assessment, depending on its evaluation scope.
- It can integrate with FastChat platform for streamlined multi-turn evaluation workflow.
- ...
- Example(s):
- MT-Bench Evaluation Categories, such as:
- MT-Bench Writing Tasks, evaluating creative writing capabilities and text composition skills.
- MT-Bench Roleplay Tasks, assessing character embodiment and persona consistency.
- MT-Bench Extraction Tasks, testing information extraction capabilities.
- MT-Bench Reasoning Tasks, measuring logical reasoning and analytical thinking.
- MT-Bench Math Tasks, evaluating mathematical problem-solving.
- MT-Bench Coding Tasks, assessing programming capabilities.
- MT-Bench STEM Tasks, testing scientific knowledge and technical understanding.
- MT-Bench Humanities Tasks, evaluating humanities knowledge and cultural understanding.
- MT-Bench Model Evaluations, such as:
- GPT-4 MT-Bench Evaluation, achieving top-tier scores across all categories.
- Claude MT-Bench Evaluation, demonstrating strong multi-turn conversational performance.
- Vicuna-13B MT-Bench Evaluation, benchmarking open-source LLM performance.
- Llama-2 MT-Bench Evaluation, assessing instruction-tuned model capabilities.
- ChatGPT MT-Bench Evaluation, measuring commercial chatbot performance.
- MT-Bench Scoring Modes, such as:
- Single-Answer Grading Mode, providing absolute quality scores from 1-10.
- Pairwise Comparison Mode, determining relative model preferences.
- Reference-Based Scoring Mode, comparing against gold-standard responses.
- MT-Bench Implementations, such as:
- FastChat MT-Bench Implementation, the original implementation framework.
- HuggingFace MT-Bench Integration, providing cloud-based evaluation.
- Local MT-Bench Deployment, enabling private model evaluation.
- MT-Bench Research Applications, such as:
- Zheng et al., 2023, introducing MT-Bench and validating LLM-as-judge approach.
- LMSYS Chatbot Arena, using MT-Bench for large-scale model ranking.
- ...
- MT-Bench Evaluation Categories, such as:
- Counter-Example(s):
- MMLU Benchmark, which focuses on single-turn multiple-choice questions rather than multi-turn dialogues.
- SQuAD Benchmark, which evaluates extractive question answering without conversational interaction.
- GLUE Benchmark, which assesses sentence-level classification rather than multi-turn generation.
- HumanEval Benchmark, which tests code generation without multi-turn refinement.
- TruthfulQA Benchmark, which measures factual accuracy in single-turn responses.
- See: LLM Inference Evaluation Task, Multi-Turn Conversation Evaluation, LLM-as-Judge, Chatbot Arena, GPT-4, FastChat, Instruction-Following Evaluation, Dialogue System Evaluation, LLM Benchmark, Conversational AI Evaluation.
References
2023a
- (LMSYS, 2023) ⇒ LMSYS. (2023). "The Chatbot Arena Leaderboard". In: LMSYS Blog.
- QUOTE: The Chatbot Arena Leaderboard ranks large language models (LLMs) based on user votes in a crowdsourced, competitive setting.
This leaderboard is generated from over 1.5 million human votes collected via the Chatbot Arena.
- QUOTE: The Chatbot Arena Leaderboard ranks large language models (LLMs) based on user votes in a crowdsourced, competitive setting.
2023a
- (Zheng et al., 2023) ⇒ Zheng, L., Chiang, W. L., Zhang, S., Zheng, Y., Zhuang, S., Wei, J., ... & Gonzalez, J. E. (2023). "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena". In: arXiv preprint, arXiv:2306.05685.
- QUOTE: We introduce MT-bench, a set of challenging multi-turn questions, and evaluate models by prompting strong LLMs like GPT-4 to act as judges.
We find that LLM-as-a-judge correlates well with human preferences, and can be a cost-effective and scalable alternative to human evaluation.
- QUOTE: We introduce MT-bench, a set of challenging multi-turn questions, and evaluate models by prompting strong LLMs like GPT-4 to act as judges.
2023c
- (LMSYS, 2023) ⇒ LMSYS. (2023). "MT-Bench". In: Hugging Face Spaces.
- QUOTE: MT-Bench is a challenging multi-turn question set for evaluating chatbots.
It consists of questions spanning different categories to assess various aspects of model capabilities.
- QUOTE: MT-Bench is a challenging multi-turn question set for evaluating chatbots.
2023d
- (LMSYS, 2023c) ⇒ LMSYS. (2023). "FastChat: An Open Platform for Training, Serving, and Evaluating LLM-based Chatbots". In: _GitHub Repository_.
- QUOTE: FastChat is an open platform for training, serving, and evaluating large language model based chatbots.
FastChat powers Chatbot Arena, serving over 10 million chat requests for 70+ LLMs.
Chatbot Arena has collected over 1.5M human votes from side-by-side LLM battles to compile an online LLM Elo leaderboard.
- QUOTE: FastChat is an open platform for training, serving, and evaluating large language model based chatbots.