Dolomites Benchmark
A Dolomites Benchmark is a Long-Form Methodical Writing Benchmarking Task designed to evaluate language models on realistic, domain-specific, long-form methodical writing tasks, encompassing structured tasks across diverse professional fields.
- AKA DoLoMiTes, Domain-Specific Long-Form Methodical Tasks.
- Context:
- Task Input: Structured prompts detailing task objectives, procedures, and specific inputs relevant to the domain.
- Optional Input: Additional context or background information pertinent to the task.
- Task Output: Coherent, structured long-form text adhering to domain-specific conventions and requirements.
- Task Performance Measure/Metrics: Evaluated using reference-based methods (e.g., round-trip factual consistency) and reference-less methods (e.g., autorater judgments).
- Benchmark datasets (optional): DoLoMiTes dataset comprising 519 expert-authored tasks with 1,857 input-output examples across 25 fields.
- It can assess the capability of language models to perform complex, structured writing tasks that require domain expertise and methodical reasoning.
- It can include tasks such as drafting clinical assessments, educational lesson plans, and technical documentation, reflecting real-world professional writing scenarios.
- It can provide a typology of methodical tasks structured with objectives, procedures, inputs, and expected outputs.
- It can serve as a resource for evaluating and improving the performance of AI systems in generating coherent and contextually appropriate long-form content.
- ...
- Example(s):
- Generating a differential diagnosis report based on patient data in the medical domain.
- Creating a comprehensive lesson plan for a specific educational topic.
- Drafting a detailed project proposal in an engineering context.
- ...
- Counter-Example(s):
- Short-form text generation tasks that do not require structured reasoning or domain-specific knowledge.
- Generic language benchmarks that assess basic language understanding without focusing on professional writing tasks.
- Creative writing prompts that prioritize imaginative storytelling over structured, factual content.
- ...
- See: Domain-Specific Natural Language Generation Task, Long-Form Text Generation, Benchmark Datasets for Language Models, Professional Writing Assistance, Professional Writing Assistance.
References
2025
- (Hugging Face, 2025) ⇒ Hugging Face Team (2025). "Dolomites Dataset". Retrieved:2025-04-27.
- QUOTE: Dataset for evaluating language model performance on domain-specific methodical tasks (content unavailable in provided sources).
2025
- (Malaviya et al., 2025) ⇒ C. Malaviya, P. Agrawal, K. Ganchev, P. Srinivasan, F. Huot, J. Berant, M. Yatskar, D. Das, M. Lapata, & C. Alberti. (2025). "Dolomites: Domain-Specific Long-Form Methodical Tasks". In: Transactions of the Association for Computational Linguistics.
- QUOTE: Domain-specific writing tasks require methodically generating structured long-form outputs through complex inferences that combine contextual understanding and domain knowledge.
Our benchmark contains 519 expert-authored task specifications across 25 fields, with 1,857 revised examples demonstrating real-world applications of methodical writing.
- QUOTE: Domain-specific writing tasks require methodically generating structured long-form outputs through complex inferences that combine contextual understanding and domain knowledge.
2024a
- (Malaviya et al., 2024) ⇒ C. Malaviya, P. Agrawal, K. Ganchev, P. Srinivasan, F. Huot, J. Berant, M. Yatskar, D. Das, M. Lapata, & C. Alberti. (2024). "Dolomites: Domain-Specific Long-Form Methodical Tasks".
- QUOTE: Methodical tasks follow a structured typology with task objective, procedure, input, and output specifications.
Expert validation revealed 85% structural compliance in model-generated examples, though depth enhancement through human revision increased text complexity from 11.69 to 13.46 Flesch-Kincaid grade level.
- QUOTE: Methodical tasks follow a structured typology with task objective, procedure, input, and output specifications.
2024b
- (Google DeepMind, 2024) ⇒ Google DeepMind. (2024). "DoLoMiTes: Domain-Specific Long-Form Methodical Tasks".
- QUOTE: The benchmark dataset contains 519 expert-curated task descriptions and 1,857 instantiated examples across 25 domains.
Data structure includes development set (830 examples with reference outputs) and test set (1,037 examples without references), supporting rigorous evaluation of long-form generation models.
- QUOTE: The benchmark dataset contains 519 expert-curated task descriptions and 1,857 instantiated examples across 25 domains.
2024c
- (Dolomites Benchmark, 2024) ⇒ Dolomites Benchmark Team. (2024). "Dolomites: Domain-Specific Long-Form Methodical Tasks".
- QUOTE: This benchmark evaluates language models on realistic writing tasks requiring domain expertise, such as medical diagnosis drafting and educational lesson planning.
Task instantiations combine web-derived context with model-generated content refined through iterative expert editing.
- QUOTE: This benchmark evaluates language models on realistic writing tasks requiring domain expertise, such as medical diagnosis drafting and educational lesson planning.