Instruction-Following Dataset

From GM-RKB
Jump to navigation Jump to search

A Instruction-Following Dataset is a dataset that contain instructions for tasks, and the corresponding outputs or behaviors that should result from following those instructions.

  • Context:
    • It can (typically) be used to train or fine-tune models to understand and execute tasks based on explicit instructions.
    • It can (typically) consist of pairs of instructions and their corresponding correct outputs or actions.
    • It can (typically) be used in scenarios where it's crucial for the model to interpret and act upon user-given instructions accurately.
    • It can be employed to enhance the ability of models, especially LLMs, to generalize across a range of tasks presented as instructions.
    • It can be derived from real-world scenarios, user interactions, or can be synthetically generated.
    • ...
  • Example(s):
  • Counter-Example(s):
    • A general text dataset.
    • An image dataset used for object recognition, without any instruction-based context.
  • See: Instruction-Tuned LLM, Fine-Tuning, Task-Oriented Datasets, Supervised Learning.


References

2023

  • https://github.com/yaodongC/awesome-instruction-dataset
    • A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). We currently include three types of dataset:
      • visual-instruction-tuning (e.g. image-instruction-answer)
      • text-instruction-tuning datasets.
      • red-teaming | Reinforcement Learning from Human Feedback (RLHF) Datasets
      • Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.
    • Lists of codebse to train your LLMs:
      • nichtdax/awesome-totally-open-chatgpt: A codebase of totally open alternatives to ChatGPT
      • Size: The number of instruction tuning pairs
    • Lingual-Tags:
      • EN: Instruction datasets in English
      • CN: Instruction datasets in Chinese
      • ML: [Multi-lingual] Instruction datasets in multiple languages
    • Task-Tags:
      • MT: [Multi-task] Datasets containing multiple tasks
      • TS: [Task-specific] Datasets tailored for specific tasks
    • Generation-method:
      • HG: [Human Generated Dataset] Datasets created by humans
      • SI: [Self-Instruct] Datasets generated using self-instruct methods
      • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
      • COL: [Collection of Dataset] Dataset made from a collection of other datasets

2023

  • GBard
    • An instruction-following dataset is a dataset that contains pairs of instructions and their corresponding outputs. The instructions are typically given in natural language, and the outputs can be text, code, or images.
    • Instruction-following datasets are used to train large language models (LLMs) to follow instructions. This is a challenging task, because LLMs need to be able to understand the meaning of the instruction, and then generate the appropriate output.
    • Instruction-following datasets can be created in a variety of ways. One common approach is to collect pairs of instructions and outputs from human users. This can be done by asking users to complete tasks, such as writing code, summarizing text, or generating images, based on given instructions.
    • Another approach to creating instruction-following datasets is to use synthetic data. Synthetic data can be generated by using existing LLMs to follow instructions and generate outputs. This approach can be used to create large and diverse datasets without the need to collect data from human users.