llm-jp-eval Benchmark

From GM-RKB
Jump to navigation Jump to search

A llm-jp-eval Benchmark is a unilingual LLM benchmark (that evaluates the performance of language models) on Japanese language tasks.



References

2024

  • https://llm-jp.github.io/awesome-japanese-llm/README_en.html
    • Here are seven key points about the llm-jp-eval benchmark:
      1. llm-jp-eval is a tool for automatically evaluating large-scale Japanese language models across multiple datasets, converting existing Japanese evaluation data into datasets for text generation tasks.
      2. It supports running evaluations across multiple datasets, generating instruction data (jaster) in the same format as the evaluation data prompts, and provides details on data formats and supported datasets in the DATASET.md file.
      3. Pre-processing scripts are provided to download and prepare the evaluation and instruction datasets. Evaluation is run using a config file managed with Hydra.
      4. Evaluation results and output are saved as JSON, and can be synced with Weights & Biases (W&B) for tracking. Options allow customizing model settings, data subsets, prompt templates, and generation parameters.
      5. New datasets can be added by creating a new class in the src/llm_jp_eval/datasets directory and updating a few other files. Datasets are split into train/dev/test sets following predefined criteria based on size.
      6. The tool is distributed under the Apache License 2.0, with dataset licenses listed in DATASET.md. Contributors can report issues or submit pull requests to the dev branch.
      7. An important caveat is that models instruction-tuned on the jaster data can achieve very high llm-jp-eval scores even without using the test set for tuning, so high scores alone don't necessarily indicate superior performance vs other LLMs.

2023

  • https://github.com/llm-jp/llm-jp-eval
    • NOTES: Here are seven key points about the llm-jp-eval benchmark:
      1. llm-jp-eval is a tool for automatically evaluating large-scale Japanese language models across multiple datasets, converting existing Japanese evaluation data into datasets for text generation tasks.
      2. It supports running evaluations across multiple datasets, generating instruction data (jaster) in the same format as the evaluation data prompts, and provides details on data formats and supported datasets in the DATASET.md file.
      3. Pre-processing scripts are provided to download and prepare the evaluation and instruction datasets. Evaluation is run using a config file managed with Hydra.
      4. Evaluation results and output are saved as JSON, and can be synced with Weights & Biases (W&B) for tracking. Options allow customizing model settings, data subsets, prompt templates, and generation parameters.
      5. New datasets can be added by creating a new class in the src/llm_jp_eval/datasets directory and updating a few other files. Datasets are split into train/dev/test sets following predefined criteria based on size.
      6. The tool is distributed under the Apache License 2.0, with dataset licenses listed in DATASET.md. Contributors can report issues or submit pull requests to the dev branch.
      7. An important caveat is that models instruction-tuned on the jaster data can achieve very high llm-jp-eval scores even without using the test set for tuning, so high scores alone don't necessarily indicate superior performance vs other LLMs.