2023 TuningLanguageModelsAsTrainingD

From GM-RKB
(Redirected from Meng et al., 2023)
Jump to navigation Jump to search

Subject Headings:

Notes

  • It introduces a method called FewGen, which first tunes an autoregressive PLM on few-shot samples, then uses it to generate a large amount of novel training samples. These samples augment the original training set for better classification model fine-tuning. This approach could be highly relevant for contract agreement review, as it allows for the generation of diverse training examples from a limited set of annotated contract samples.
  • It employs a meta-learning strategy for weighted maximum likelihood training, automatically adjusting the weight of each token based on its discriminative importance. This is particularly useful in a contract review setting where certain clauses or phrases are more critical for risk assessment.
  • It demonstrates that FewGen achieves better results across seven classification tasks of the GLUE benchmark than existing few-shot learning methods. This implies that it could effectively handle various contract types and identify risk elements with minimal initial data.
  • It emphasizes the generation of label-discriminative training samples, which is crucial in the contract review context where differentiating between risk and non-risk elements is key.
  • It incorporates techniques like label smoothing and temporal ensembling in the fine-tuning process of classification models to mitigate label noise. This could ensure that the risk annotations in contract review are accurate and reliable.
  • It acknowledges potential ethical concerns with PLM-generated text, such as disinformation and biases, which are important considerations in legal contexts like contract reviews.
  • It recognizes that FewGen, while effective, requires more computational resources and time compared to direct few-shot learning methods. This might be a consideration in terms of the practical deployment of such a system in a business environment.

Cited By

Quotes

Abstract

Recent studies have revealed the intriguing few-shot learning ability of pretrained language models (PLMs): They can quickly adapt to a new task when fine-tuned on a small amount of labeled data formulated as prompts, without requiring abundant task-specific annotations. Despite their promising performance, most existing few-shot approaches that only learn from the small training set still underperform fully supervised training by nontrivial margins. In this work, we study few-shot learning with PLMs from a different perspective: We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples which augment the original training set. To encourage the generator to produce label-discriminative samples, we train it via weighted maximum likelihood where the weight of each token is automatically adjusted based on a discriminative meta-learning objective. A classification PLM can then be fine-tuned on both the few-shot and the synthetic samples with regularization for better generalization and stability. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods, improving no-augmentation methods by 5 + average points, and outperforming augmentation methods by 3 + average points.

1. Introduction

Recent research has demonstrated the appealing few- shot learning potential of pretrained language models (PLMs) (Brown et al., 2020; Clark et al., 2020; Devlin et al., 2019; He et al., 2021; Liu et al., 2019; Meng et al., 2021a; 2022b) on natural language understanding (NLU) tasks (Wang et al., 2019; 2018): Instead of relying on abun- dant task-specific annotations, PLMs can effectively lever- age a small set of training samples to quickly learn a new task. Such training data efficiency is usually achieved by for- mulating downstream tasks as prompts (Brown et al., 2020; Gao et al., 2021; Scao & Rush, 2021; Schick & Schu¨ tze, 2021a;d), allowing the PLM to adapt its language modeling ability acquired through pretraining to downstream tasks.

The success of prompt-based methods has stimulated nu- merous explorations along the line of effective few-shot learning with PLMs: The training samples converted to natural language prompts can be used to directly fine-tune PLMs (Gao et al., 2021; Schick & Schu¨ tze, 2021a) or as in-context demonstrations to facilitate better inference (Liu et al., 2022b; Min et al., 2022b). Recent approaches aim to automate the design of prompts by gradient-based search- ing (Shin et al., 2020) or parameterizing prompts as con- tinuous learnable embeddings (Lester et al., 2021; Zhang et al., 2022; Zhong et al., 2021). Other studies investigate and address specific issues in prompt-based few-shot learn- ing (Liu et al., 2022a; Tam et al., 2021; Zhao et al., 2021). While remarkable, the model performance still has a non- trivial gap from fully supervised models trained on massive labeled data. Indeed, training deep models is inherently data demanding—model generalization usually benefits from more training samples (Baum & Haussler, 1988).

In this work, we study few-shot learning with PLMs from a different perspective: Instead of proposing new methods for fine-tuning on few-shot samples, we focus on the gen- eration of quality training data based on few-shot samples and using these synthesized training samples to fine-tune the classification models. Motivated by the strong text gen- eration power of autoregressive PLMs (Brown et al., 2020; Keskar et al., 2019; Raffel et al., 2019), a few previous studies enlarge the training set by generating new texts as training samples. They either fine-tune the generator on the initial training set with the standard maximum likelihood objective (Anaby-Tavor et al., 2020; Kumar et al., 2020) or use the training samples as demonstrations (Yoo et al., 2021). However, these methods do not explicitly model the distinction across different labels and may struggle to generate accurate training samples pertaining to the desired labels for challenging NLU tasks.

In this paper, we explore how to effectively use few-shot samples to tune PLMs for generating high quality label- discriminative training samples. Our contributions are as follows: (1) We analyze the issues of using standard max- imum likelihood for tuning the generator and propose a meta-weighted maximum likelihood objective by automati- cally learning token weights that emphasize label discrimi- nativeness. (2) We propose a simple and effective training procedure for fine-tuning classification PLMs on generated data by mitigating label noise. (3) Under the same few-shot learning setting, our method FewGen outperforms existing methods by 3+ average points on seven classification tasks of the GLUE benchmark (Wang et al., 2018). Ablation stud- ies validate the effectiveness of our proposed meta-weighted training objective and classifier fine-tuning method.1

2. Related Work

Few-Shot Learning with PLMs. Few-shot learning has gained much attention recently due to its minimal resource assumption—Without requiring massive annotated data but only leveraging a few training samples (e.g., 16 per label), few-shot methods can be widely adopted in many prac- tical scenarios where obtaining large-scale annotations is unaffordable. Standard fine-tuning of PLMs for few-shot learning usually performs poorly because the limited train- ing samples may not be sufficient for optimizing the pa- rameters in the newly introduced classification head. To reuse the language modeling ability of PLMs without in- troducing randomly initialized parameters, prompt-based approaches (Brown et al., 2020; Gao et al., 2021; Hu et al., 2022; Logan IV et al., 2021; Min et al., 2022a; Schick & Schu¨ tze, 2021a;b;d; Tam et al., 2021) formulate training samples as natural language prompt templates so that var- ious downsteam tasks can be solved as a token prediction problem. They enjoy improved training data efficiency over standard fine-tuning in low-data regimes (Scao & Rush, 2021) and achieve remarkable few-shot learning perfor- mance. Later developments in prompt-based methods re- place the manual design of prompt templates with automatic search or learning (Cui et al., 2022; Hambardzumyan et al., 2021; Lester et al., 2021; Liu et al., 2021b; Zhang et al., 2022; Zhong et al., 2021). There are also studies focusing on specific issues (Liu et al., 2022a; Tam et al., 2021; Zhao et al., 2021) in prompt-based methods. Instead of proposing fine-tuning methods for few-shot learning, we study how to generate quality training samples as augmentations by learning from the few-shot samples.

1Code can be found at https://github.com/ yumeng5/FewGen.

Data Augmentation. Data augmentation methods (Chen et al., 2020; Huang et al., 2022; Lee et al., 2021; Meng et al., 2021b; Miyato et al., 2017; Xie et al., 2020) aim to create similar samples to the existing ones so that the en- larged training set can benefit model generalization. Early approaches simply use manually designed rules (e.g., swap- ping or inserting tokens) for word-level alterations over the given samples to create new ones (Wei & Zou, 2019). Later methods leverage the strong generation power of PLMs to synthesize novel samples from scratch. Given a training set, the PLMs can be either fine-tuned on the labeled samples to learn label-conditioned generation probability (Kumar et al., 2020; Lee et al., 2021; Yang et al., 2020) or take the labeled data as demonstrations (Wang et al., 2021; Yoo et al., 2021) to generate similar samples pertaining to the same label. In this work, we study how to effectively tune generators on few-shot training data for creating new data—standard fine- tuning of PLMs on a small set of training data is prone to overfitting, and the resulting model may struggle to generate accurate, diverse and novel training data. We address this challenge by leveraging prefix-tuning and proposing a new meta-weighted generator tuning objective that emphasizes label-distinctive tokens.

Controlled Text Generation. Generating training sam- ples for different labels can be viewed as a form of con- trolled text generation (Hu et al., 2017), whose goal is to generate textual contents of desired semantics, styles or attributes. Such control can be realized through different stages of PLM training and deployment: During pretraining, control codes (Keskar et al., 2019) can be used as explicit guidance for training the model to generate domain/attribute- specific texts; fine-tuning PLMs with attribute-specific data can also grant high-level control (e.g., certain topics or sen- timents (Ziegler et al., 2019)), fine-grained control (e.g., specific words or phrases (Chan et al., 2021)) or both (Khal- ifa et al., 2021); at inference time, control over desired attributes can also be enforced without updating the PLM parameters (Dathathri et al., 2020; Krause et al., 2021; Ku- mar et al., 2021; Liu et al., 2021a; Pascual et al., 2021; Yang & Klein, 2021). More specifically related to the idea of generating training data with language models, early meth- ods in text classification use bag-of-words or LSTM-based language models (Meng et al., 2018; 2019) to generate class- conditioned texts as training data. Recently, a few studies explore fine-tuning autoregressive PLMs (Anaby-Tavor et al., 2020; Yang et al., 2020) with the standard language modeling objective on the training set or using label-specific prompts (Gao et al., 2023; Meng et al., 2022a; Schick & Schu¨tze, 2021c; Wang et al., 2021; Ye et al., 2022) to steer text generation towards the desired label. In this work, we analyze issues with directly tuning PLMs on few-shot sam- ples with the standard maximum likelihood objective and propose a weighted variant of the objective that encourages the PLM to focus on label-discriminative tokens.

Meta-Learning for Sample Weighting. The idea of weighting training samples in the loss calculation originates from the class imbalance (Wang et al., 2017) and noisy label (Hendrycks et al., 2018) learning scenarios—By as- signing higher weights to the samples from minority classes or lower weights to the noisy samples, the learning pro- cess is less impacted by the imbalance/label noise issues. Meta-learning (Andrychowicz et al., 2016; Finn et al., 2017; Franceschi et al., 2018; Wu et al., 2018) is one way to auto- matically learn the weight for each sample. Specifically, a meta objective, usually defined as the loss on a clean unbi- ased validation set (Ren et al., 2018; Shu et al., 2019), can be used to learn the sample weights which become hyperpa- rameters that control the optimization of model parameters. Our work has a different motivation and formulation of the meta objective for token-wise weighted training: Not all tokens in a training sample are equally label-discriminative. We thus design a meta objective to emphasize distinction across different labels (instead of using the validation loss as the meta objective) for learning the token weights.

3.2. Label-Discriminative Text Generator Tuning Motivation. To model the conditional text generation probability p(x yl) on different labels, a straightforward way is to parameterize a generation model Gθp for each

label yl via a set of prefix vectors θp = {θp }|L so that


3. Method

3.1. Preliminaries

p(x yl) = pθp (x), and then tune θpl on the training sam- ples x with label yl:

Overview. We consider the strict few-shot learning set- ting (Perez et al., 2021): The training set train = (x, y)i consists of K training samples per label where x =

min gen, gen(θpl θpl

n ) = log p n j=1


θpl

(xj|x<j

). (1)

[x1, x2, . . . , xn] is a text sequence with n tokens. The de- velopment set dev is of the same size as train. There is no access to additional task-specific unlabeled data. The number of training samples K is assumed to be very small (e.g., K = 16), making it challenging to train a classifi- cation model Cϕ that generalizes well to unseen data. To mitigate the training data scarcity issue, we first train an autoregressive PLM on train, and then use it as a generator Gθ to synthesize more novel samples gen = (x˜, y˜)i that augment the original training set. Finally, a classification PLM Cϕ is fine-tuned on both train and gen to perform the task. An overview of FewGen is shown in Fig. 1.

Text Generation with Autoregressive PLMs. In standard fine-tuning for text generation, an autoregressive PLM Gθ is trained via the maximum likelihood generation loss of each token in a sequence x conditioned on previous tokens:

1 n min − log p (xj|x<j),


where the token generation probability pθ( ) is usually pa- rameterized using token embeddings e and hidden states h of a Transformer (Vaswani et al., 2017) model. After train- ing, Gθ can be used to generate novel texts by iteratively sampling tokens from its generation probability distribution.

Prefix-Tuning. Unlike fine-tuning which updates all model parameters θ of a PLM, prefix-tuning (Li & Liang, 2021) freezes all pretrained Transformer parameters and only optimizes prefix vectors θp that are prepended to each Transformer layer. We use prefix-tuning for training Gθp on train because (1) it offers better effectiveness than fine- tuning for small datasets (Li & Liang, 2021) and (2) the generation models for different labels can share the same backbone Transformer parameters with only the prefix vec- tors being different, significantly reducing the memory re- quirement for multi-class classification tasks.

However, such an approach only optimizes the generative likelihood p(x yl) without accounting for label discrimina- tiveness p(yl x) which is essential for generating unambigu- ous training samples to benefit the final classification task. Challenging NLU tasks can have largely similar distribu- tions across different labels, with very nuanced differences reflected by a few key tokens. For example, a negative re- view text “a movie where the ending feels like a cop-out” may immediately become a positive one by just changing the last word “cop-out” to “revelation”. Indeed, we find that such subtle distinctions over different labels may not be effectively captured using the standard generation objec- tive in Eq. (1) where each token contributes equally to the overall loss. As shown in Fig. 2, a discriminative loss disc (defined in Eq. (2)) can even increase during training—It is possible that the dominating patterns in the training samples are label-indiscriminate (e.g., a movie review dataset may frequently mention “the movie”), making the generators of different labels eventually converge to similar distributions, especially when there are limited training samples per label.

θ n j=1


exp(e⊤hj)

To promote the generation of label-discriminative texts, we encourage each token xj to be more likely generated under

pθ(xj|x<j) = �

|V |

j . exp(e⊤h )

the corresponding label yl instead of other labels (i.e., maxi- ′

j′=1

j′ j

mize pθp (xj|x<j) and minimize pθ


pl′

(xj|x<j) for l

̸= l)


Generator Training Lw-gen Figure 1: Overview of FewGen. A generator PLM is first tuned on the few-shot samples with our proposed meta-weighted training objective and then used to synthesize new training samples. A classification PLM is finally trained on both the few-shot and the generated samples.


3

2

0 100 200 300 400 Training Steps



2

1

00 100 200 300 400 Training Steps



to be more label-discriminative than “the movie”, and the former should be paid more attention to during training. It is thus natural to generalize gen in Eq. (1) to w-gen as follows by assuming a weight wj is given for each token.

min L , L (θ ; w) = − L w Lj (θ ), (3)


Figure 2: (On MNLI) Training the generator via Lgen does not automatically decrease Ldisc.

Lgen

(θpl ) = log pθp

(xj|x<j).

via a discriminative loss Ldisc: Ldisc(θ ) = − 1 L Lj


(θ ),

Note that in w-gen, w is assumed to be the hyperparameter under which θpl is optimized. When wj is the same for every token, Eq. (3) will be equivalent to Eq. (1). While it

p n j=1

disc p


(2)

is possible to manually design weighting rules for setting w to promote label-discriminative learning, they will likely

(θ ) = pθpl (xj|x<j) . �


necessitate task-specific knowledge and nontrivial tuning.


Although one can directly combine disc with gen to train Gθp to enforce distinction across different labels, doing so will result in two undesirable consequences: (1) A hyper- parameter needs to be introduced to balance the weights of the two losses, whose optimal value is likely to vary by task; and (2) directly updating generator parameters with the

using the idea of meta-learning.

Meta Weight Learning Setup. To automatically learn token weights as hyperparameters, we formulate a bi-level optimization problem using the idea of meta-learning. The inner objective Lw-gen optimizes the generator parameters

discriminative loss disc will worsen the language modeling quality of the generator, making it prone to generating less fluent and coherent texts after training.

Weighted Maximum Likelihood Generator Tuning. To

θp given the token weights wj:

n j gen


(θp),

preserve the generative learning of Gθp while emphasiz- ing label-discriminative tokens, we assume each token is associated with a weight in the maximum likelihood loss. In- tuitively, when our goal is to generate distinctive texts across different labels as training samples, not all tokens should contribute equally to generator training. For example, for sentiment classification tasks, one would expect “good/bad”

j=1 ∗ p θp

where the token weights wj(ω) are parameterized and learned via a weighting network gω (details about its im- plementation are in Appendix A). The weighting network


Algorithm 1 Meta-Weighted Generator Tuning. Input: train: Few-shot training set. Parameter: T : Number of training steps. Output: θp: Prefix parameters for all labels. Initialize θ(0) (with task-descriptive prompts) and ω(0) for t [0, 1, . . . , T 1] do Sample a minibatch from θˆ(t) (ω(t)) ← Take one gradient step to descend


Algorithm 2 Classifier fine-tuning on Dtrain and Dgen. Input: train: Few-shot training set; gen: Synthesized training set. Parameter: T : Number of training steps. Output: ϕ: Trained classification model parameters. ϕ(0) ← Train on Dtrain with standard supervised learning z¯ ← 0 // Initialize ensembled prediction for t ∈ [0, 1, . . . , T − 1] do

Lw-gen

θ(t); ω(t)) on B

B ← Sample a minibatch from Dgen ϕ(t+1) ← Take one gradient step to descend Lclass in

ω(t+1) ← Take) one gradient step to descend

Eq. (5) on B


(t+1) p Lw-gen


← Take one gradient step to descend θ(t); ω(t+1) on B

Update gen to exclude noisy samples based on z¯ end return ϕ = ϕ(T )


end return θp

= θ(T )


It can be seen that the gradient descent direction of ω is de- termined by a sum of token weight gradient ascent directions (i.e., ∂wj (ω) ) weighted by a scalar dj, where dj character-

parameters ω are trained with an outer objective Ldisc: n Ldisc(θ (ω)) = − L (θ (ω)),


∂ω izes the similarity between the gradient of the discriminative objective and the gradient of the generative objective on the jth token. Therefore, the meta weights will be higher on


ω∗ = argmin disc. ω

Under the above bi-level optimization formulation, the dis- criminative loss disc is not used to directly update generator parameters, but to automatically learn token weights that are used as hyperparameters by the inner objective w-gen. As the token weights are trained to minimize disc, the gen- erator focuses more on label-discriminative tokens. We use an online optimization strategy (Shu et al., 2019) instead of nested optimization loops to optimize ω∗ and θ∗ for training efficiency. It also guarantees convergence to the critical points of both w-gen and disc under mild conditions. We initialize the prefix parameters θp using natural language prompts, and the details can be found in Appendix B. The overall training procedure is shown in Algorithm 1.

Analysis of Meta Weight Learning. To study how the token weights are learned during training, we analyze the gradients of the weighting network parameters ω which are

more beneficial for minimizing the discriminative objective, so that label-distinctive information is better emphasized.

3.3. Classifier Fine-Tuning With the trained generator Gθp , we can synthesize novel training samples gen that augment train for fine-tuning a classification PLM Cϕ. The major challenge to effectively leverage gen is that the label noise (i.e., some generated samples may not accurately pertain to the corresponding label) may deteriorate model performance if standard su- pervised learning is directly used. We propose a simple noise-robust training procedure to improve the generaliza- tion and stability of training: First fine-tune Cϕ on train with standard supervised training, and then continue fine- tuning it on gen by applying label smoothing (Szegedy et al., 2016) and temporal ensembling (Laine & Aila, 2017) as regularization, following (Meng et al., 2022a). Specifi- cally, given a training sample (x˜, y˜) gen, we minimize the following classification loss:

optimized via Eq. (4) (detailed derivation in Appendix C):

Lclass(ϕ) = − L q log(p

(x˜) ) − λ L z¯ log pϕ(x˜)l ,

∂L θˆ(t) (ω) 1


1


where

l l=1

ϕ l

and

l l=1

z¯l


(5)

∂ω

∂Ldisc θp

1ω=ω(t) ∂Lj


j=1

(θ ) 1

∂ω 1ω=ω(t)

ing weight; pϕ(x˜) is the model prediction on x˜; λ is a regularization weight for temporal ensembling; and z¯ is the accumulated moving-average model predictions. We also

dj =

∂θˆp 1

(t) p

gen p ∂θp

. 1θ =θ(t)

use the ensembled prediction z¯ to filter out noisy synthe- sized samples: We only include those samples for training


where z¯ strongly agrees with the label y˜ (i.e., z¯y˜ > δ where δ > 0 is a threshold parameter). In Eq. (5), the first classifi- cation term is the cross-entropy loss with smoothed labels; the second regularization term corresponds to temporal en- sembling, which requires the current model prediction to be close to its past accumulated predictions. This not only neutralizes the fluctuation in model predictions for better training stability when label noise is present (Nguyen et al., 2020) but also helps prevent catastrophic forgetting (Kirk- patrick et al., 2017) of the information learned previously from the few-shot training set train. Please refer to Ap- pendix B for details about the temporal ensembling imple- mentation. The overall procedure of classifier fine-tuning is summarized in Algorithm 2.

4. Experimental Setup Downstream Tasks and Metrics. We conduct evaluation on all tasks of the GLUE benchmark (Wang et al., 2018) (more details in Appendix D) except STS-B which is a re- gression task. We follow the same data split and evaluation protocol as (Gao et al., 2021): Both train and dev con- tain 16 samples per label and are sampled from the original training set with 5 different random seeds. The original de- velopment sets are used for testing. For all reported results, we include the average and standard deviation over the 5 different train/ dev splits. F1 score is used as the metric for QQP and MRPC, Matthews correlation for CoLA, and accuracy for the remaining tasks.

Models and Training Settings. FewGen is a training data generation method and can be used with any fine-tuning method on any classification model. We use moderate-sized PLMs to ensure our results are reproducible on typical re- search hardware: CTRL (1.6B parameters) (Keskar et al., 2019) as the generator Gθ and RoBERTaLarge (356M pa- rameters) (Liu et al., 2019) as the classifier Cϕ. We use prefix-tuning for training Gθ and prompt-based fine-tuning for training Cϕ. For simplicity, we use the most basic man- ual prompt version of LM-BFF (Gao et al., 2021). The only exception is CoLA for which we use the standard fine- tuning since the input data might be out of the distribution of Cϕ (Gao et al., 2021). The hyperparameter tuning is performed on Ddev. More details are in Appendix B. Compared Methods. No-augmentation baselines include zero-shot prompting, standard fine-tuning, in-context learn- ing, and the following strong few-shot learning methods: Four versions of LM-BFF (Gao et al., 2021), P-Tuning (Liu et al., 2021b) and DART (Zhang et al., 2022). We also com- pare with data augmentation methods for few-shot learn- ing: MixText (Chen et al., 2020), using back translation systems to generate paraphrases (UDA-style (Xie et al., 2020) augmentation), a few-shot demonstration method

GPT3Mix (Yoo et al., 2021), and standard fine-tuning of generator on the few-shot samples with prompts. For fair comparisons, all augmentation methods use LM-BFF (Man.) to fine-tune a RoBERTaLarge classifier. We also include the results of fully-supervised fine-tuning. More details about augmentation baselines are in Appendix E.

5. Evaluation

5.1. Main Results

We present the results of FewGen and baselines in Table 1. FewGen achieves overall better performance across the GLUE tasks, on average 5+ points higher than the pre- vious best few-shot method without augmentation, and 3+ points better than GPT3Mix2 (Yoo et al., 2021) which uses a 100 times larger generator model (175B) than FewGen.

Comparison with Back Translation. Using back transla- tion to paraphrase the few-shot samples does not improve the results—this is probably because it does not produce samples that are sufficiently different from the few-shot training set. The success of UDA (Xie et al., 2020) is grounded in the augmentations from abundant unlabeled data that improve the classifier generalization. However, under the strict few-shot learning setup, there is no access to additional task-specific unlabeled data (Gao et al., 2021), making it challenging for paraphrase-based methods to cre- ate sufficiently diverse training samples only based on the small few-shot set. The new training samples produced by our FewGen method are not limited to the paraphrases of the few-shot samples, as the generator is trained via prefix- tuning to preserve the PLM’s pretraining knowledge, based on which novel training samples can be synthesized.

Comparison with GPT3Mix. The gigantic size of GPT3 makes it challenging for tuning on few-shot samples. There- fore, GPT3Mix (Yoo et al., 2021) uses few-shot samples as demonstrations for creating the augmentations. Such an approach suffers from two limitations: (1) Without any parameter update to the PLM, its learning ability is not fully leveraged to adapt to the few-shot training set. (2) The PLM can only use a small subset of the few-shot samples at a time for creating each augmentation, as the number of demonstrations received by the model is bounded by its maximum input sequence length. This makes the quality of the created augmentations more sensitive to the randomly drawn training samples. Our FewGen method, on the other hand, can use the entire few-shot set for tuning the PLM and achieves overall even better classification results with a much smaller PLM (< 1% the size of the GPT3 model)

2The original GPT3Mix paper uses accuracy as the metric instead of Matthews correlation for CoLA; our reimplemented GPT3Mix achieves 79.40.6 on CoLA if measured by accuracy.

Table 1: Results on seven classification tasks of the GLUE benchmark. We report average and standard deviation (as subscripts) performance over 5 different train/ dev splits defined in (Gao et al., 2021). †: Results from (Gao et al., 2021). ‡: Results from (Zhang et al., 2022). Methods that use additional models apart from the final classification model are marked.

(Acc.)

Method MNLI-(m/mm) QQP QNLI SST-2 CoLA RTE MRPC FewGen 75.71.6/77.11.0 71.51.7 76.34.4 93.10.8 40.07.5 71.22.4 81.12.5 w. Lgen 74.91.0/76.21.0 70.71.9 75.04.8 92.50.7 37.88.2 69.52.2 80.83.0 w. Lgen + Ldisc 74.61.6/76.01.5 68.82.1 76.14.3 92.40.8 41.29.0 70.12.2 79.62.4 − label smooth 75.01.3/76.21.0 71.11.8 76.53.5 92.70.7 39.38.6 69.41.9 81.32.8 − temporal ensemble 72.22.5/74.02.2 65.82.1 75.12.7 92.11.7 33.94.4 66.62.4 80.43.2 w. fine-tune on Dtrain ∪ Dgen 68.91.8/70.61.9 64.31.5 71.14.1 91.81.3 34.03.2 59.61.0 80.43.5


which can be deployed much more easily in practice.

5.2. Ablation Studies The overall performance gain brought by FewGen over a no-augmentation counterpart can be seen by comparing Few-


2

1

00 100 200 300 400

3.0

2.8

2.6

2.4

2.2 0 100 200


300 400

Gen with LM-BFF (Man.) which uses the same classifier and fine-tuning method on Dtrain only. We further analyze

Training Steps (a) Ldisc during training

Training Steps (b) Dev set loss during training

the effectiveness of each important component in FewGen via the following ablations: (1) Using the standard gen in Eq. (1) instead of our proposed w-gen in Eq. (3) for genera- tor tuning (w. gen); (2) using the directly combined gen and disc for generator tuning (w. gen + disc); (3) without applying label smoothing in Eq. (5) ( label smooth); (4) without applying temporal ensembling in Eq. (5) ( tem- poral ensemble); (5) directly fine-tuning the classification model on the combination of gen and train (w. fine-tune on train gen)3. As shown in Table 2, (1) & (2) using the standard maximum likelihood loss or the combination of

3For this ablation, we upsample Dtrain by ×100 so that its size is comparable with Dgen; otherwise, the result is much worse.

Figure 3: With different generator tuning objectives, (a) Ldisc and (b) language modeling loss on the dev set. generative and discriminative losses to tune the generator both yield lower-quality training data and lead to degraded classification performance; (3) & (4) not applying regular- ization techniques for fine-tuning the classifier is more prone to label noise in the generated samples; (5) fine-tuning the classifier on the combination of gen and train significantly underperforms our two-step fine-tuning method.

5.3. Analyses of Loss Functions for Generator Tuning As shown in Table 2, the choice of generator loss has a significant impact on the synthesized data quality and thus

Table 3: Quantitative evaluation of generator training objectives. We use two metrics: Generated data accuracy (Acc; higher is better) and generator’s perplexity on the test set (PPL; lower is better). The results are averaged over 5 Dtrain/Ddev splits.

MNLI QQP QNLI SST-2 CoLA RTE MRPC Acc. (↑) PPL (↓) Acc. (↑) PPL (↓) Acc. (↑) PPL (↓) Acc. (↑) PPL (↓) Acc. (↑) PPL (↓) Acc. (↑) PPL (↓) Acc. (↑) PPL (↓) Lgen 69.4 13.1 87.5 10.9 57.0 23.4 91.5 43.8 59.1 85.6 82.9 9.3 87.6 5.0 Lgen + Ldisc 70.2 13.5 87.3 11.2 57.2 24.8 92.0 49.5 59.2 87.0 82.8 9.6 86.3 5.3 Lw-gen 72.3 11.9 89.5 10.7 60.1 23.2 93.2 43.5 60.7 83.8 83.4 8.9 90.5 4.9 Table 4: (For MNLI) Examples of generated second sequence (hypothesis) by generators tuned with three different objectives conditioned on a given first sequence (premise) “In 2009, hair samples from 1,137 Taiji residents were tested for mercury by the National Institute for Minamata Disease (NIMD)”. The true label of the generated sequence is marked at the end of the sequence (if the generated sequence correctly pertains to the target label, it is marked in blue; otherwise, it is in red).

Objective Label Generated Second Sequence

entailment The NIMD found that hair samples from 1,137 Taiji residents had mercury in their hair. (neutral)

Lgen

Lgen + Ldisc

Lw-gen

The NIMD found that there was no evidence of a link between exposure to high levels of mercury and thyroid cancer. (neutral) contradiction There was no evidence of mercury in hair samples from Taiji. (neutral) entailment The number of hairs in a sample is equal to the number of people who lived in Taiji. (neutral) neutral The results showed that there was no significant difference in levels of mercury. (neutral) contradiction Hair samples from 1,137 Taiji residents were not tested. (contradiction) entailment The NIMD tested hair samples from 1,137 residents of Taiji. (entailment) There was no significant difference in levels between people who lived near a nickel mine and those living far away. (neutral) contradiction The NIMD did not test any of the hair samples. (contradiction)


the final model performance. We conduct further analyses to compare the training processes of the generator under the following three loss functions and the resulting generated samples: (1) gen which is the standard language modeling loss; (2) gen + disc which directly adds the discriminative loss to generator training; and (3) w-gen which is our meta- weighted objective. Fig. 3 shows the discriminative loss disc and the standard language modeling loss on the held- out development set throughout training. Although using gen + disc helps reduce the discriminative loss, it comes at the cost of hindering language modeling—the generator loss on the development set is high. Using our meta-weighted objective w-gen not only encourages discriminativeness but also mitigates overfitting, yielding the lowest validation set loss. This is likely because the model receives contrastive in- formation from other labels which facilitates more accurate modeling of the texts with the target label. Quantitative Analyses. Apart from the final classification model performance which indirectly reflects the synthetic data quality, we additionally conduct more direct quantita- tive analyses of different generator training objectives. We use two metrics: (1) The accuracy of generated texts, which is judged by fully-supervised RoBERTaLarge models fine- tuned on the original training sets of each task. We choose to adopt such an automatic evaluation instead of human eval- uation because it is efficient and reliable—fully-supervised RoBERTaLarge models have comparable or better accuracy

than human baselines according to the GLUE benchmark4.

(2) The generator’s perplexity on the test sets, which re- flects how well the generator models the task distribution. As shown in Table 3, using w-gen for generator training consistently outperforms using gen or gen + disc, both in generated text accuracy and in language modeling ability. Comparing w-gen with gen, the meta weights automati- cally learned emphasize discriminative tokens in generator training and help the generator capture subtle semantic dif- ferences across different labels, resulting in better language modeling quality and more distinctive synthetic data. Comparing w-gen with gen + disc, the generator train- ing objective is not directly impacted by the discriminative objective, thus avoiding the gradient interference issue in multi-task learning (Standley et al., 2019)—the gradient for optimizing the generative probability p(x yl) will be inter- fered by the gradient optimizing the discriminative probabil- ity p(yl x) if gen + disc is used. Therefore, using w-gen results in better language modeling quality and more fluent and coherent generation results. Qualitative Analyses. We showcase concrete generation results for the three labels of MNLI by models trained with the three different loss functions in Table 4. The model trained with gen produces fluent and coherent sentences, but the generated sentences do not accurately pertain to

4https://gluebenchmark.com/leaderboard


Sentence 1: But prophecy is always strongest when based on coincidence--that is a prime rule. Label: Contradiction

Sentence 2: weights


0.03


0.02


0.03


0.11


0.06


0.06

weak 

0.03 0.03 0.05 0.33


0.06


0.21


Sentence 1: But Rodgers did tell Lewis that he despises Amelio because Amelio supported Clinton, so it is Rodgers' mistake, not our author's, that we are correcting.


Label: Entailment

Sentence 2: hates Amelio.

weights 0.14 0.08 0.07

0.08 0.47 0.17

Figure 4: Visualization of learned token weights on two samples from MNLI’s few-shot training set. The generator is trained given the first sentence to generate the second. The tokens associated with higher weights are more label indicative.


the desired label (i.e., the “entailment” and “contradiction” generation results are in fact neutral with respect to the given sentence), lacking label discriminativeness. When gen + disc is used, the generated samples of different labels are more distinctive, but also become less natural and coherent due to the model’s language modeling ability being hampered. The generator tuned with w-gen produces both coherent and label-discriminative samples. More concrete generation results for each task can be found in Appendix F.

5.4. Visualization of Learned Token Weights To understand how token weights are automatically learned during generator tuning, we visualize the learned weights in Fig. 4. The tokens with higher weights (e.g., “weak” in the first example and “hates” in the second example) are learned to be important tokens that decide the relation of the second sentence to the first sentence (i.e., the label of the training sample). With such tokens emphasized during training, the generator is encouraged to capture label-discriminative information that facilitates the generation of unambiguous training samples.

6. Discussions and Conclusions

Ethical Considerations
Despite the impressive text generation and representation power of PLMs, they can also come with the risk (Bender et al., 2021; Bender & Koller, 2020; Brown et al., 2020) of generating disinfor- mation (Pagnoni et al., 2021) or exacerbating biases (Prab- humoye et al., 2018). Instead of improving upon PLM architectures or generation techniques, our work focuses on using existing PLMs to create training data for NLU tasks. In practice, our method can be combined with any bias reduction and correction strategies (Gehman et al., 2020; Ma et al., 2020) to reduce the adverse effects of PLMs.
Limitations
Compared to few-shot learning methods that directly train classification models on the small training set, FewGen requires tuning a generator PLM and using it to synthesize novel training samples, resulting in higher computation costs and longer running time. Still, we believe that our method may bring more good than harm—when the small training data size becomes the performance bottleneck for NLU tasks, a simple yet costly solution is to obtain more human annotations. Our method may replace or reduce the human efforts in such training data creation processes.
Conclusions
In this work, we propose FewGen, which leverages few-shot training samples to tune a generator PLM for synthesizing novel training data. The generated data can be then used in combination with few-shot samples to fine- tune a classification model for better generalization. To emphasize label-discriminative information during gener- ator tuning, we propose a weighted maximum likelihood objective where the token weights are automatically learned via a discriminative meta objective. Since the generated samples may contain label noise, we propose a simple training procedure that first trains classifiers on the few-shot training set and then on the generated set by applying regu- larization for noise-robustness. Across seven classification tasks from the GLUE benchmark, FewGen significantly outperforms existing approaches under the same few-shot learning setting. The effectiveness of each important com- ponent in FewGen is validated via ablation studies. Future directions may include: Using larger PLMs as the generator and the classifier, jointly training both models with each other’s high-confident predictions, improving the robustness of models trained on synthetic data, and developing sys- tematic metrics to evaluate the quality of generated training samples.

References

Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. Toward controlled generation of text. In ICML, 2017. Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J. Large language models can self-improve. ArXiv, abs/2210.11610, 2022. Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H. T., Heafield, K., Neckermann, T., Seide, F., Germann, U., Aji, A. F., Bogoychev, N., Martins, A. F. T., and Birch, A. Marian: Fast neural machine translation in C++. In ACL System Demo, 2018. Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and Socher, R. CTRL: A conditional transformer language model for controllable generation. ArXiv, abs/1909.05858, 2019. Khalifa, M., ElSahar, H., and Dymetman, M. A distribu- tional approach to controlled text generation. In ICLR, 2021. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des- jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 2017. Krause, B., Gotmare, A. D., McCann, B., Keskar, N. S., Joty, S. R., Socher, R., and Rajani, N. GeDi: Generative discriminator guided sequence generation. In EMNLP, 2021. Kumar, S., Malmi, E., Severyn, A., and Tsvetkov, Y. Con- trolled text generation as continuous optimization with multiple constraints. In NeurIPS, 2021. Kumar, V., Choudhary, A., and Cho, E. Data augmentation using pre-trained transformer models. In Workshop on Life-long Learning for Spoken Language Systems, 2020. Laine, S. and Aila, T. Temporal ensembling for semi- supervised learning. In ICLR, 2017. Lee, K., Guu, K., He, L., Dozat, T., and Chung, H. W. Neu- ral data augmentation via example extrapolation. arXiv preprint arXiv:2102.01335, 2021. Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 2021. Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. DExperts: Decoding-time controlled text generation with experts and anti-experts. In ACL, 2021a.

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. Few-shot parameter-efficient fine- tuning is better and cheaper than in-context learning. In NeurIPS, 2022a. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out, 2022b. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. GPT understands, too. ArXiv, abs/2103.10385, 2021b. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining ap- proach. arXiv preprint arXiv:1907.11692, 2019. Logan IV, R. L., Balazˇevic´, I., Wallace, E., Petroni, F., Singh, S., and Riedel, S. Cutting down on prompts and param- eters: Simple few-shot learning with language models. arXiv preprint arXiv:2106.13353, 2021. Ma, X., Sap, M., Rashkin, H., and Choi, Y. PowerTrans- former: Unsupervised controllable revision for biased language correction. In EMNLP, 2020. Meng, Y., Shen, J., Zhang, C., and Han, J. Weakly- supervised neural text classification. In CIKM, 2018. Meng, Y., Shen, J., Zhang, C., and Han, J. Weakly- supervised hierarchical text classification. In AAAI, 2019. Meng, Y., Xiong, C., Bajaj, P., Tiwary, S., Bennett, P., Han, J., and Song, X. COCO-LM: Correcting and contrast- ing text sequences for language model pretraining. In NeurIPS, 2021a. Meng, Y., Zhang, Y., Huang, J., Wang, X., Zhang, Y., Ji, H., and Han, J. Distantly-supervised named entity recog- nition with noise-robust learning and language model augmented self-training. In EMNLP, 2021b. Meng, Y., Huang, J., Zhang, Y., and Han, J. Generating training data with language models: Towards zero-shot language understanding. In NeurIPS, 2022a. Meng, Y., Xiong, C., Bajaj, P., Tiwary, S., Bennett, P., Han, J., and Song, X. Pretraining text encoders with adversarial mixture of training signal generators. In ICLR, 2022b. Min, S., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Noisy channel language model prompting for few-shot text classification. In ACL, 2022a.


Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022b. Miyato, T., Dai, A. M., and Goodfellow, I. J. Adversarial training methods for semi-supervised text classification. In ICLR, 2017. Nguyen, D. T., Mummadi, C. K., Ngo, T.-P.-N., Nguyen, T. H. P., Beggel, L., and Brox, T. SELF: Learning to filter noisy labels with self-ensembling. In ICLR, 2020. Pagnoni, A., Balachandran, V., and Tsvetkov, Y. Understand- ing factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In NAACL, 2021. Pascual, D., Egressy, B., Meister, C., Cotterell, R., and Wattenhofer, R. A plug-and-play method for controlled text generation. In EMNLP Findings, 2021. Perez, E., Kiela, D., and Cho, K. True few-shot learning with language models. In NeurIPS, 2021. Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., and Black, A. W. Style transfer through back-translation. In ACL, 2018. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2019. Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In ICML, 2018. Scao, T. L. and Rush, A. M. How many data points is a prompt worth? In NAACL, 2021. Schick, T. and Schu¨ tze, H. Exploiting cloze-questions for few-shot text classification and natural language infer- ence. In EACL, 2021a. Schick, T. and Schu¨ tze, H. Few-shot text generation with natural language instructions. In EMNLP, 2021b. Schick, T. and Schu¨ tze, H. Generating datasets with pre- trained language models. In EMNLP, 2021c. Schick, T. and Schu¨ tze, H. It’s not just size that matters: Small language models are also few-shot learners. In NAACL, 2021d. Shankar, I., Nikhil, D., and Korne´l, C. First Quora dataset release: Question pairs, 2017. URL https://www.quora.com/q/quoradata/

Shin, T., Razeghi, Y., IV, R. L. L., Wallace, E., and Singh, S. Eliciting knowledge from language models using auto- matically generated prompts. In EMNLP, 2020. Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. Meta-weight-net: Learning an explicit mapping for sample weighting. In NeurIPS, 2019. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. Standley, T. S., Zamir, A. R., Chen, D., Guibas, L. J., Ma- lik, J., and Savarese, S. Which tasks should be learned together in multi-task learning? In ICML, 2019. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In CVPR, 2016. Tam, D., Menon, R. R., Bansal, M., Srivastava, S., and Raffel, C. Improving and simplifying pattern exploiting training. In EMNLP, 2021. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In NeurIPS, 2017. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP Workshop BlackboxNLP, 2018. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Su- perGLUE: A stickier benchmark for general-purpose lan- guage understanding systems. In NeurIPS, 2019. Wang, Y.-X., Ramanan, D., and Hebert, M. Learning to model the tail. In NIPS, 2017. Wang, Z., Yu, A. W., Firat, O., and Cao, Y. Towards zero- label language learning. ArXiv, abs/2109.09193, 2021. Warstadt, A., Singh, A., and Bowman, S. R. Neural network acceptability judgments. In TACL, 2019. Wei, J. and Zou, K. EDA: Easy data augmentation tech- niques for boosting performance on text classification tasks. In EMNLP, 2019. Williams, A., Nangia, N., and Bowman, S. A broad- coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018. Wu, L., Tian, F., Xia, Y., Fan, Y., Qin, T., Lai, J., and Liu, T.-Y. Learning to teach with dynamic loss functions. In

First-Quora-Dataset-Release-Question-PairsN. eurIPS, 2018.


Xie, Q., Dai, Z., Hovy, E. H., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation for consistency training. In NeurIPS, 2020. Yang, K. and Klein, D. FUDGE: Controlled text generation with future discriminators. In NAACL, 2021. Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S., Bras, R. L., ping Wang, J., Bhagavatula, C., Choi, Y., and Downey, D. G-daug: Generative data augmentation for commonsense reasoning. In EMNLP Findings, 2020. Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T., and Kong, L. ZeroGen: Efficient zero-shot learning via dataset generation. In EMNLP, 2022. Yoo, K. M., Park, D.-H., Kang, J., Lee, S.-W., and Park, W.

GPT3Mix: Leveraging large-scale language models for text augmentation. In EMNLP Findings, 2021. Zhang, N., Li, L., Chen, X., Deng, S., Bi, Z., Tan, C., Huang, F., and Chen, H. Differentiable prompt makes pre-trained language models better few-shot learners. In ICLR, 2022. Zhao, T., Wallace, E., Feng, S., Klein, D., and Singh, S. Calibrate before use: Improving few-shot performance of language models. In ICML, 2021. Zhong, Z., Friedman, D., and Chen, D. Factual probing is [mask]: Learning vs. learning to recall. In NAACL, 2021. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine- tuning language models from human preferences. ArXiv, abs/1909.08593, 2019.;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 TuningLanguageModelsAsTrainingDYu Zhang
Jiaxin Huang
Jiawei Han
Yu Meng
Martin Michalski
Tarek Abdelzaher
Tuning Language Models As Training Data Generators for Augmentation-Enhanced Few-Shot Learning10.48550/arXiv.2211.030442023