2023 LEvalInstitutingStandardizedEva

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Long Context Language Model (LCLM).

Notes

Cited By

Quotes

Abstract

Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have demonstrated considerable advancements in handling tens of thousands of tokens of context, open-sourced models are still in the early stages of experimentation. It also remains unclear whether developing these long context models can offer substantial gains on practical downstream tasks over retrieval-based methods or models simply trained on chunked contexts. To address this challenge, we propose to institute standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 query-response pairs manually annotated and checked by the authors encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind their commercial counterparts, they still exhibit impressive performance. LLaMA2 achieves the best results (win 45\% vs turbo-16k) on open-ended tasks with only 4k context length and ChatGLM2 achieves the best results on closed-ended tasks with 8k input tokens. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at https://github.com/OpenLMLab/LEval.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 LEvalInstitutingStandardizedEvaJun Zhang
Lingpeng Kong
Xipeng Qiu
Chenxin An
Shansan Gong
Ming Zhong
Mukai Li
L-Eval: Instituting Standardized Evaluation for Long Context Language Models10.48550/arXiv.2307.110882023