LLM-based System Quality Evaluation Report

From GM-RKB
Jump to navigation Jump to search

An LLM-based System Quality Evaluation Report is a specialized quality-focused LLM-based system evaluation report that can consolidate LLM-based system output quality assessments, LLM-based system generation accuracy measures, and LLM-based system response coherence analysis through LLM-based system quality evaluation tasks.



References

2025-01-27

[1] https://www.ibm.com/think/insights/llm-evaluation - IBM - LLM Evaluation
[2] https://arxiv.org/abs/2211.09110 - Holistic Evaluation of Language Models
[3] https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation - Confident AI - LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide
[4] https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-eval-llm-results.html - Amazon Web Services - Review metrics for RAG evaluations that use LLMs (console)
[5] https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/evaluation/list-of-eval-metrics - Microsoft Learn - Evaluation metrics
[6] https://www.comet.com/site/blog/llm-evaluation-metrics-every-developer-should-know/ - Comet ML - Key LLM Evaluation Metrics & How to Calculate Them
[7] https://www.datacamp.com/blog/llm-evaluation - DataCamp - LLM Evaluation: Metrics, Methodologies, Best Practices
[8] https://datanorth.ai/blog/evals-openais-framework-for-evaluating-llms - DataNorth - OpenAI Evals: Evaluating LLM's
[9] https://ehudreiter.com/2024/06/11/llm-vs-human-eval/ - Ehud Reiter's Blog - Can LLM-based eval replace human evaluation?
[10] https://arxiv.org/html/2311.11123v2 - (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs
[11] https://arxiv.org/html/2507.19390v1 - ReCatcher: Towards LLMs Regression Testing for Code Generation
[12] https://www.evidentlyai.com/llm-guide/llm-evaluation-metrics - Evidently AI - LLM evaluation metrics and methods
[13] https://arxiv.org/html/2503.16431v1 - OpenAI's Approach to External Red Teaming for AI Models and Systems
[14] https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete - Lenny's Newsletter - Beyond vibe checks: A PM's complete guide to evals
[15] https://galileo.ai/blog/llm-as-a-judge-vs-human-evaluation - Galileo AI - LLM-as-a-Judge vs Human Evaluation
[16] http://www.gabormelli.com/RKB/LLM-based_System_Component - GM-RKB - LLM-based System Component
[17] https://www.gabormelli.com/RKB/LLM_observability_framework - GM-RKB - LLM-based System Observability Framework
[18] https://openai.com/index/gpt-4-research/ - OpenAI - GPT-4
[19] https://www.nature.com/articles/s41598-025-15203-5 - Nature - A scalable framework for evaluating multiple language models
[20] https://www.alibabacloud.com/blog/best-practices-for-llm-evaluation_601903 - Alibaba Cloud Community - Best Practices for LLM Evaluation
[21] https://galileo.ai/blog/llm-reliability - Galileo AI - LLM Reliability Evaluation Methods to Prevent Production Failures