LLM-as-Judge
(Redirected from AI Judge)
Jump to navigation
Jump to search
A LLM-as-Judge is an llm evaluation method that is a model-based evaluation technique where a large language model evaluates llm-generated outputs (based on evaluation criteria and scoring rubrics).
- AKA: LLM Judge, Model-as-Judge, AI Judge, LLM Evaluator, Automated LLM Judge.
- Context:
- It can typically perform Pairwise Comparisons between llm output alternatives, llm model responses, and llm generation variants.
- It can typically conduct Direct Scoring Assessments using llm rating scales, llm quality rubrics, and llm evaluation prompts.
- It can typically evaluate Response Relevance through llm semantic analysis, llm context alignment, and llm query matching.
- It can typically assess Output Correctness via llm factual verification, llm logical consistency, and llm accuracy checking.
- It can typically measure Generation Quality through llm fluency scores, llm coherence ratings, and llm style assessments.
- ...
- It can often detect Hallucination Patterns using llm factuality checks, llm source grounding, and llm claim verification.
- It can often identify Safety Violations through llm toxicity detection, llm bias identification, and llm harmful content flagging.
- It can often provide Explanation Generation with llm reasoning traces, llm score justifications, and llm improvement suggestions.
- It can often implement Multi-Criteria Evaluations across llm quality dimensions, llm assessment aspects, and llm performance factors.
- It can often support Domain-Specific Assessments for llm specialized tasks, llm expert domains, and llm professional contexts.
- ...
- It can range from being a Simple LLM-as-Judge to being a Complex LLM-as-Judge, depending on its llm judge evaluation complexity.
- It can range from being a Binary LLM-as-Judge to being a Multi-Class LLM-as-Judge, depending on its llm judge output granularity.
- It can range from being a Single-Turn LLM-as-Judge to being a Multi-Turn LLM-as-Judge, depending on its llm judge interaction depth.
- It can range from being a Reference-Free LLM-as-Judge to being a Reference-Based LLM-as-Judge, depending on its llm judge grounding requirement.
- It can range from being a General LLM-as-Judge to being a Specialized LLM-as-Judge, depending on its llm judge domain focus.
- ...
- It can utilize Evaluation Prompt Templates with llm judge instructions, llm scoring guidelines, and llm assessment criteria.
- It can employ Chain-of-Thought Reasoning for llm judge deliberation, llm evaluation steps, and llm reasoning processes.
- It can leverage Few-Shot Examples through llm judge demonstrations, llm evaluation samples, and llm scoring examples.
- It can implement Consistency Checks via llm judge calibration, llm score validation, and llm evaluation reliability.
- ...
- Example(s):
- Commercial LLM Judges, such as:
- Open-Source LLM Judges, such as:
- Specialized LLM Judges, such as:
- Evaluation Framework Judges, such as:
- ...
- Counter-Example(s):
- Human Judge, which uses human evaluators rather than llm automated assessment.
- Rule-Based Evaluator, which applies deterministic rules rather than llm flexible judgment.
- Exact Match Scorer, which checks string equality rather than llm semantic evaluation.
- Statistical Metric, which calculates numerical measures rather than llm qualitative assessment.
- See: LLM Evaluation Method, Large-Scale Language Model (LLM), Pairwise Comparison, Direct Scoring, G-Eval, Evaluation Prompt, LLM Benchmark, Human Evaluation, LLM Safety Metric, LLM Application Evaluation Framework, LLM DevOps Framework.