Finetuned Large Language Model (LLM)

Context:
- It can (typically) be an output of an LLM Fine-Tuning System (that solve an LLM fine-tuning task).
- …
Example(s):
- an Instruction-Tuned Large Language Model (LLM), fine-tuned to follow instructions more accurately.
- a Domain-Specific Fine-Tuned LLM, fine-tuned to excel in particular domains or industries, like finance, law, or healthcare.
- a Task-Specific Fine-Tuned LLM, fine-tuned for particular tasks such as sentiment analysis, language translation, or summarization.
- a Language-Specific Fine-Tuned LLM, fine-tuned to perform better in a specific under-represented language.
- a Demographic-Specific Fine-Tuned LLM, fine-tuned to understand and generate text that resonates with specific demographic groups, considering cultural nuances, dialects, and colloquialisms.
- a Conversational LLM.
- …
Counter-Example(s):
- a General Purpose Pretrained LLM, such as GPT-4.
- a Domain-Specific Pretrained LLM, such as:
- …
See: Med-Palm 2, Pure Pretrained LLM.

References

(Chungon et al., 2022) ⇒ Hyung W. Chungon, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, et al. (2022). “Scaling Instruction-finetuned Language Models.” arXiv preprint arXiv:2210.11416
- ABSTRACT: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

(Gunel et al., 2020) ⇒ Beliz Gunel, Jingfei Du, Alexis Conneau, and Ves Stoyanov. (2020). “Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning.” arXiv preprint arXiv:2011.01403
- ABSTRACT: State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

(Lee et al., 2019) ⇒ Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. (2019). “Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models.” arXiv preprint arXiv:1909.11299
- ABSTRACT: In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.