2022 PTuningPromptTuningCanBeCompara

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural language pretraining and even vision pretraining. In this work, we explore the trans- fer of prompt tuning to multimodal pretrain- ing, with a focus on generative multimodal pretrained models, instead of contrastive ones. Specifically, we implement prompt tuning on the unified sequence-to-sequence pretrained model adaptive to both understanding and gen- eration tasks. Experimental results demon- strate that the light-weight prompt tuning can achieve comparable performance with finetun- ing and surpass other light-weight tuning meth- ods. Besides, in comparison with finetuned models, the prompt-tuned models demonstrate improved robustness against adversarial at- tacks. We further figure out that experimental factors, including the prompt length, prompt depth, and reparameteratization, have great impacts on the model performance, and thus we empirically provide a recommendation for the setups of prompt tuning. Despite the ob- served advantages, we still find some limita- tions in prompt tuning, and we correspond- ingly point out the directions for future studies. Codes are available at https://github.com/OFA-Sys/OFA

1 Introduction

Recent years have witnessed the great success of large-scale pretraining based on large models and big data in natural language processing (NLP) (Rad- ford et al., 2018; Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020) and computer vision (Chen et al., 2020b,a,c; Chen and He, 2021; Bao et al., 2021; He et al., 2021b). Mirroring the success of BERT- like models (Devlin et al., 2019), researchers have found that pretraining can level up the down- stream performance of cross-modal representation learning algorithms by a large margin (Chen et al., 2020d; Lu et al., 2019; Su et al., 2020; Tan and Bansal, 2019). Recent advances show that this idea is compatible with sequence-to-sequence (Seq2Seq) learning, and the Seq2Seq-based multi- modal pretrained model can adapt to both under- standing and generation tasks, and even achieve the state-of-the-art performance in a series of down- stream tasks (Cho et al., 2021; Wang et al., 2021, 2022).

Despite the great success of large-scale pre- trained models across multiple domains, training such models requires a large amount of computa- tion costs. The conventional finetuning is though effective in gaining high performance yet suffers from low training efficiency, especially when the pretrained model is of large scale in model size. Brown et al. (2020) introduced the idea of prompt to encourage the model to generate the correct an- swer with a manual prompt of task instruction or a demonstration of several task examples, without further training to tune the model parameters. This is often regarded as “in-context learning”, as the model generates responses based on the given con- text. It helps large-scale pretrained language mod- els achieve unprecedented performance in few-shot and zero-shot learning (Brown et al., 2020; Chowd- hery et al., 2022; Sanh et al., 2021; Wei et al., 2021). Inspired by this idea, researchers have moved for- ward to a new paradigm called prompt tuning (Li and Liang, 2021; Liu et al., 2021c; Lester et al., 2021; Liu et al., 2021a). In comparison with fine- tuning, prompt tuning only tunes pretrained mod- els by a trivial amount of parameters (e.g., 1%). Prompt tuning freezes most parameters of the pre- trained model and only tunes several prompt em- beddings, as well as the output layer if necessary. Recent advances have shown that prompt tuning can help pretrained models achieve comparable performance with finetuning across different NLP downstream tasks, including natural language understanding and generation (Liu et al., 2021b; He et al., 2021a). Such significant achievements have attracted attention of the research community of large pretrained models.

In the domains other than NLP, recent studies have also demonstrated the effectiveness of prompt tuning. Jia et al. (2022) demonstrated that visual prompt tuning could surpass finetuning across a series of tasks, and its advantages in training ef- ficiency were significant. In cross-modal repre- sentation learning, the research in prompt tuning mainly focuses on the CLIP-like models (Radford et al., 2021). CLIP is a contrastive-learning-based multimodal pretrained model, pretrained on large- scale image-text pairs. CLIP is able to achieve outstanding performance in zero-shot image classi- fication by turning labels to textual prompts with manual prompt templates. To enhance the per- formance, Radford et al. (2021) proposed prompt ensembling by handcrafting a number of prompt templates. However, as creating hard prompts is te- dious, researchers turned to the application of soft prompts for CLIP (Rao et al., 2021; Zhou et al., 2021, 2022) or the incorporation of adapters (Gao et al., 2021; Zhang et al., 2021). Except for the implementation on CLIP-like models, another line of work is the application of image prompts to pretrained language models for multimodal repre- sentation learning (Yao et al., 2021b; Tsimpoukelli et al., 2021). Though the large-scale pretrained laguage model is frozen in the process of down- stream transfer, it can adapt to the few-shot learn- ing scenarios of multimodal downstream task. Be that as it may, prompt tuning for the popular gen- erative multimodal pretrained models, including BERT-like models and encoder-decoder pretrained models for cross-modal representation learning, is still unexplored. Yao et al. (2022) matched the tun- ing paradigm to the pretraining one with manual prompts. Yet it is still unknown whether the light- weight prompt tuning can also be effective for the generative multimodal pretrained model.

This work fills in the void and takes the lead to explore prompt tuning for the generative mul- timodal pretrained models. The objective of this study is to investigate whether prompt tuning is effective for the downstream transfer of generative multimodal pretrained models, and how it bene- fits large pretrained models in comparison with the conventional finetuning. To be specific, we imple- ment the simple but effective prefix tuning, one of the most popular prompt tuning methods, on the generative multimodal pretrained model. Pre- fix tuning owns the advantage of simplicity but at the same time is able to achieve remarkable perfor- mance in either natural language understanding or generation (Li and Liang, 2021; Liu et al., 2021b). In comparison with finetuning, the number of tun- able parameters for prompt tuning is much smaller (~1%), leading to fewer computation costs, e.g., memory.

Through extensive experiments we observe that the light-weight prompt tuning is able to help the pretrained model achieve comparable perfor- mance with finetuning across 4 multimodal down- stream tasks, spanning from understanding to gen- eration. To analyze the difference between fine- tuning and prompt tuning, we follow the assump- tion that prompt tuning with most parameters in the pretrained model frozen should induce model robustness. We experiment on the tuning meth- ods with adversarial attack and observe phenom- ena consistent with the hypothesis. To make a step further, this study delves into the implemen- tation details and investigate whether experimen- tal factors like the prompt length, prompt depth, and reparameterization could saliently influence the final downstream performance. We find that in general a longer prompt length (longer than 20 tokens) is a preferable choice, and our experiments show that 64 should be favored in most cases as a longer prompt sequence will not only increase the computation costs but also incur performance degradation. Also, we show that reparameterizaton with additional trainable parameters cannot intro- duce significant improvements in downstream per- formance. Finally, we reflect on the method and illustrate its defects of computation costs and train- ing instabilities, and correspondingly, we point out some directions for the future work.

In the following sections, we briefly review the related work, deliver an introduction to prompt tun- ing for generative multimodal pretrained models, and report the experimental results and analysis. Lastly, we discuss the problems of prompt tuning in this scenario, point out the future work, and fi- nally conclude this work.

2 Related Work

In this section, we include the review of multi-modal pretraining as well as prompt tuning. We first review the studies in the two main lines of multimodal pretraining, namely generative pretraining and contrastive pretraining, and we then review re- searches of prompt-based learning in both NLP and cross-modal representation learning.

2.1 Multimodal Pretrainingv =

The rise of vision & language pretraining started from the transfer of BERT (Devlin et al., 2019) to cross-modal representation learning. A series of studies (Lu et al., 2019; Su et al., 2020; Tan and Bansal, 2019; Chen et al., 2020d; Li et al., 2019) introduced BERT to multimodal pretraining. The encoder-decoder framework for multimodal pre- training has recently raised attention, as a number of encoder-decoder models achieved state-of-the-art performance in the cross-modal understanding and generation tasks (Wang et al., 2021, 2022; Yu et al., 2022). Besides, such framework allows the unification of tasks to sequence-to-sequence learning format and thus allows multitask pretraining with manual prompts (Cho et al., 2021; Wang et al., 2022). This leads to our motivation that prompt tuning should be a perfect combination with the recent unified multimodal pretrained model and it can unleash the power of pretrained models with much fewer computation costs than the conven- tional fine-tuning.

Another trend in multimodal pretraining is con- trastive learning. The most typical constrastive pretrained model is CLIP (Radford et al., 2021). It uses a Vision Transformer (ViT) (Dosovitskiy et al., 2021) or ResNet (He et al., 2016; Tan and Le, 2019) as the image encoder and a transformer model as the text encoder, and trains the two en- coders jointly with contrastive loss (van den Oord et al., 2018). Note that this model is pretrained on extremely large-scale data of image-text pairs. Fol- lowing CLIP, a series of studies demonstrated the success of this route of contrastive-learning-based pretraining on large-scale data (Jia et al., 2021; Yao et al., 2021a). CLIP can achieve remarkable per- formance in cross-modal retrieval. What makes it really attractive is its strong performance in ze- roshot classification with prompt ensembling, i.e., ensembling the outputs of the model with a handful of handcrafted prompts as the inputs. This started the research of prompt in multimodal pertaining.

2.2 Prompt-based Learning

Brown et al. (2020) illustrated that large-scale pre- trained models can learn from the context and perform few-shot and zero-shot learning with the prompts of task instruction or a few task examples. Instead of using hard prompts by handcrafting, Li and Liang (2021) demonstrated that only tuning soft prompt embeddings at each layer is sufficient for the pretrained model to achieve competitive performance in natural language generation, and later a number of studies showed that prompt tun- ing can be essentially effective for low-resource scenarios (Liu et al., 2021c; Gu et al., 2022; Sun et al., 2022b) and it can even achieve comparable performance with finetuning (Lester et al., 2021; Liu et al., 2021b). Following this trend, a series of modification to prompts and adapters (Hu et al., 2022; He et al., 2021a; Jiang et al., 2022; Sun et al., 2022a) for improvements in performance or train- ing efficiency have emerged and made prompt tun- ing a heated topic in the whole NLP community.

Recent prompt tuning methods for multimodal pretrained models mostly serve for CLIP-like mod- els (Zhou et al., 2021, 2022; Rao et al., 2021). Similarly, researchers tried to incorporate adapters to CLIP and also achieved satisfactory perfor- mance (Gao et al., 2021; Zhang et al., 2021). Except for prompt tuning for CLIP-like models, another line of work explored visual prompts for frozen language models. Tsimpoukelli et al. (2021) showed that when there is a powerful large pretrained language model, a visual encoder for prompt tuning is sufficient for multimodal few-shot learning. To take a step forward, Alayrac et al. (2022) proposed Flamingo, a colossal multimodal model that enables in-context learning. It could achieve state-of-the-art performance in a series of cross-modal downstream tasks in either few-shot or full-shot learning scenarios. Such tremendous success indicates the strong potential of prompt tuning in multimodal pretraining. In this work, we focus on an unexplored topic, prompt tuning for generative multimodal pretrained model.

3 Method

...

References

  • (Alayrac et al., 2022) ⇒ Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. (2022). "Flamingo: A Visual Language Model for Few-Shot Learning.” In: CoRR, abs/2204.14198.
  • (Anderson et al., 2016) ⇒ Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. (2016). "SPICE: Semantic Propositional Image Caption Evaluation.” In: ECCV 2016, Lecture Notes in Computer Science, Volume 9909, pages 382–398. Springer.
  • (Antol et al., 2015) ⇒ Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. (2015). "VQA: Visual Question Answering.” In: ICCV 2015, pages 2425–2433. IEEE Computer Society.
  • (Bao et al., 2021) ⇒ Hangbo Bao, Li Dong, and Furu Wei. (2021). "Beit: Bert Pre-Training of Image Transformers.” In: arXiv preprint arXiv:2106.08254.
  • (Brown et al., 2020) ⇒ Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. (2020). "Language Models are Few-Shot Learners.” In: NeurIPS 2020.
  • (Chen et al., 2020a) ⇒ Ting Chen, Simon Kornblith, Kevin Norouzi, Mohammad Swersky, and Geoffrey Hinton. (2020a). "Big Self-Supervised Models are Strong Semi-Supervised Learners.” In: NeurIPS 2020, pages 10466–10478. PMLR.
  • (Chen et al., 2020b) ⇒ Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. (2020b). "A Simple Framework for Contrastive Learning of Visual Representations.” In: ICML 2020, pages 1597–1607. PMLR.
  • (Chen et al., 2020c) ⇒ Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. (2020c). "Improved Baselines with Momentum Contrastive Learning.” In: arXiv preprint arXiv:2003.04297.
  • (Chen et al., 2015) ⇒ Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. (2015). "Microsoft COCO Captions: Data Collection and Evaluation Server.” In: CoRR, abs/1504.00325.
  • (Chen & He, 2021) ⇒ Xinlei Chen and Kaiming He. (2021). "Exploring Simple Siamese Representation Learning.” In: CVPR 2021, pages 15750–15758.
  • (Chen et al., 2020d) ⇒ Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. (2020d). "UNITER: Universal Image-Text Representation Learning.” In: ECCV 2020, Lecture Notes in Computer Science, Volume 12375, pages 104–120. Springer.
  • (Cho et al., 2021) ⇒ Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. (2021). "Unifying Vision-and-Language Tasks via Text Generation.” In: ICML 2021, Proceedings of Machine Learning Research, Volume 139, pages 1931–1942. PMLR.
  • (Chowdhery et al., 2022) ⇒ Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. (2022). "Palm: Scaling Language Modeling with Pathways.” In: CoRR, abs/2204.02311.
  • (Devlin et al., 2019) ⇒ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In: NAACL-HLT 2019, pages 4171–4186. Association for Computational Linguistics.
  • (Dong et al., 2017) ⇒ Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. (2017). "Boosting Adversarial Attacks with Momentum.” In: CoRR, abs/1710.06081.
  • (Dosovitskiy et al., 2021) ⇒ Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In: ICLR 2021. OpenReview.net.
  • (Gao et al., 2021) ⇒ Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. (2021). "CLIP-Adapter: Better Vision-Language Models with Feature Adapters.” In: CoRR, abs/2110.04544.
  • (Goodfellow et al., 2014) ⇒ Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. (2014). "Explaining and Harnessing Adversarial Examples.” In: CoRR, abs/1412.6572.
  • (Goyal et al., 2017) ⇒ Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. (2017). "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering.” In: CVPR 2017, pages 6325–6334. IEEE Computer Society.
  • (Gu et al., 2022) ⇒ Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. (2022). "PPT: Pre-trained Prompt Tuning for Few-Shot Learning.” In: ACL 2022, pages 8410–8423. Association for Computational Linguistics.
  • (He et al., 2021a) ⇒ Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. (2021a). "Towards a Unified View of Parameter-Efficient Transfer Learning.” In: CoRR, abs/2110.04366.
  • (He et al., 2021b) ⇒ Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. (2021b). "Masked Autoencoders Are Scalable Vision Learners.” In: arXiv preprint arXiv:2111.06377.
  • (He et al., 2016) ⇒ Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. (2016). "Deep Residual Learning for Image Recognition.” In: CVPR 2016, pages 770–778.
  • (Houlsby et al., 2019) ⇒ Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. (2019). "Parameter-Efficient Transfer Learning for NLP.” In: ICML 2019, Proceedings of Machine Learning Research, Volume 97, pages 2790–2799. PMLR.
  • (Hu et al., 2022) ⇒ Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Jingang Wang, Juanzi Li, Wei Wu, and Maosong Sun. (2022). "Knowledgeable Prompt-Tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification.” In: ACL 2022, pages 2225–2240. Association for Computational Linguistics.
  • (Jia et al., 2021) ⇒ Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. (2021). "Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision.” In: arXiv preprint arXiv:2102.05918.
  • (Jia et al., 2022) ⇒ Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. (2022). "Visual Prompt Tuning.” In: CoRR, abs/2203.12119.
  • (Jiang et al., 2022) ⇒ Yuezihan Jiang, Hao Yang, Junyang Lin, Hanyu Zhao, An Yang, Chang Zhou, Hongxia Yang, Zhi Yang, and Bin Cui. (2022). "Instance-wise Prompt Tuning for Pretrained Language Models.” In: CoRR, abs/2206.01958.
  • (Lavie & Agarwal, 2007) ⇒ Alon Lavie and Abhaya Agarwal. (2007). "METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments.” In: WMT@ACL 2007, pages 228–231. Association for Computational Linguistics.
  • (Lester et al., 2021) ⇒ Brian Lester, Rami Al-Rfou, and Noah Constant. (2021). "The Power of Scale for Parameter-Efficient Prompt Tuning.” In: EMNLP 2021, pages 3045–3059. Association for Computational Linguistics.
  • (Li et al., 2019) ⇒ Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. (2019). "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training.” In: CoRR, abs/1908.06066.
  • (Li & Liang, 2021) ⇒ Xiang Lisa Li and Percy Liang. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation.” In: ACL/IJCNLP 2021, pages 4582–4597. Association for Computational Linguistics.
  • (Lin et al., 2019) ⇒ Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E. Hopcroft. (2019). "Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks.” In: CoRR, abs/2107.13586.
  • (Liu, Yuan et al., 2021) ⇒ Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. (2021a). "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.” In: CoRR, abs/2107.13586.
  • (Liu, Ji et al., 2021a) ⇒ Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, and Jie Tang. (2021b). "P-tuning v2: Prompt Tuning Can be Comparable to Fine-Tuning Universally Across Scales and Tasks.” In: CoRR, abs/2110.07602.
  • (Liu, Zheng et al., 2021) ⇒ Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. (2021c). "GPT Understands, Too.” In: CoRR, abs/2103.10385.
  • (Liu et al., 2019) ⇒ Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach.” In: CoRR, abs/1907.11692.
  • (Lu et al., 2019) ⇒ Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. (2019). "VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.” In: NeurIPS 2019, pages 13–23.
  • (Madry et al., 2017) ⇒ Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. (2017). "Towards Deep Learning Models Resistant to Adversarial Attacks.” In: CoRR, abs/1706.06083.
  • (Mao et al., 2016) ⇒ Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. (2016). "Generation and Comprehension of Unambiguous Object Descriptions.” In: CVPR 2016, pages 11–20. IEEE Computer Society.
  • (Papineni et al., 2002) ⇒ Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation.” In: ACL 2002, pages 311–318.
  • (Radford et al., 2021) ⇒ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. (2021). "Learning Transferable Visual Models from Natural Language Supervision.” In: ICML 2021, Proceedings of Machine Learning Research, Volume 139, pages 8748–8763. PMLR.
  • (Radford et al., 2018) ⇒ Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. (2018). "Improving Language Understanding by Generative Pre-Training.”
  • (Raffel et al., 2020) ⇒ Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” In: Journal of Machine Learning Research, 21(140):1–140:67.
  • (Rao et al., 2021) ⇒ Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. (2021). "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting.” In: CoRR, abs/2112.01518.
  • (Sanh et al., 2021) ⇒ Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M. Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. (2021). "Multitask Prompted Training Enables Zero-Shot Task Generalization.” In: CoRR, abs/2110.08207.
  • (Su et al., 2020) ⇒ Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. (2020). "VL-BERT: Pre-Training of Generic Visual-Linguistic Representations.” In: ICLR 2020. OpenReview.net.
  • (Sun et al., 2022a) ⇒ Tianxiang Sun, Zhengfu He, Hong Qian, Xuanjing Huang, and Xipeng Qiu. (2022a). "BBTv2: Pure Black-Box Optimization Can Be Comparable to Gradient Descent for Few-Shot Learning.” In: CoRR, abs/2205.11200.
  • (Sun et al., 2022b) ⇒ Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. (2022b). "Black-Box Tuning for Language-Model-as-a-Service.” In: ICML 2022, Proceedings of Machine Learning Research, Volume 162, pages 20841–20855. PMLR.
  • (Tan & Bansal, 2019) ⇒ Hao Tan and Mohit Bansal. (2019). "LXMERT: Learning Cross-Modality Encoder Representations from Transformers.” In: EMNLP-IJCNLP 2019, pages 5099–5110. Association for Computational Linguistics.
  • (Tan & Le, 2019) ⇒ Mingxing Tan and Quoc V. Le. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” In: ICML 2019, Proceedings of Machine Learning Research, Volume 97, pages 6105–6114. PMLR.
  • (Tsimpoukelli et al., 2021) ⇒ Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. (2021). "Multimodal Few-Shot Learning with Frozen Language Models.” In: NeurIPS 2021, pages 200–212.
  • (Van den Oord et al., 2018) ⇒ Aäron van den Oord, Yazhe Li, and Oriol Vinyals. (2018). "Representation Learning with Contrastive Predictive Coding.” In: CoRR, abs/1807.03748.
  • (Vaswani et al., 2017) ⇒ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. (2017). "Attention Is All You Need.” In: NeurIPS 2017, pages 5998–6008.
  • (Vedantam et al., 2015) ⇒ Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. (2015). "CIDEr: Consensus-Based Image Description Evaluation.” In: CVPR 2015, pages 4566–4575. IEEE Computer Society.
  • (Wang et al., 2022) ⇒ Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. (2022). "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.” In: CoRR, abs/2202.03052.
  • (Wang et al., 2021) ⇒ Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. (2021). "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision.” In: CoRR, abs/2108.10904.
  • (Wei et al., 2021) ⇒ Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. (2021). "Finetuned Language Models Are Zero-Shot Learners.” In: CoRR, abs/2109.01652.
  • (Xie et al., 2019) ⇒ Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. (2019). "Visual Entailment: A Novel Task for Fine-Grained Image Understanding.” In: CoRR, abs/1901.06706.
  • (Yang et al., 2019) ⇒ Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding.” In: NeurIPS 2019, pages 5754–5764.
  • (Yao et al., 2021a) ⇒ Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. (2021a). "FILIP: Fine-Grained Interactive Language-Image Pre-Training.” In: CoRR, abs/2111.07783.
  • (Yao et al., 2022) ⇒ Yuan Yao, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. (2022). "PEVL: Position-Enhanced Pre-Training and Prompt Tuning for Vision-Language Models.” In: CoRR, abs/2205.11169.
  • (Yao et al., 2021b) ⇒ Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. (2021b). "CPT: Colorful Prompt Tuning for Pre-Trained Vision-Language Models.” In: CoRR, abs/2109.11797.
  • (Yu et al., 2022) ⇒ Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. (2022). "CoCa: Contrastive Captioners Are Image-Text Foundation Models.” In: CoRR, abs/2205.01917.
  • (Yu et al., 2016) ⇒ Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. (2016). "Modeling Context in Referring Expressions.” In: ECCV 2016, Lecture Notes in Computer Science, Volume 9906, pages 69–85. Springer.
  • (Zaken et al., 2022) ⇒ Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. (2022). "BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models.” In: ACL 2022, pages 1–9. Association for Computational Linguistics.
  • (Zhang et al., 2021) ⇒ Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. (2021). "TIP-Adapter: Training-Free CLIP-Adapter for Better Vision-Language Modeling.” In: CoRR, abs/2111.03930.
  • (Zhou et al., 2021) ⇒ Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. (2021). "Learning to Prompt for Vision-Language Models.” In: CoRR, abs/2109.01134.
  • (Zhou et al., 2022) ⇒ Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. (2022). "Conditional Prompt Learning for Vision-Language Models.” In: CoRR, abs/2203.05557.;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2022 PTuningPromptTuningCanBeComparaJie Tang
Zhilin Yang
Xiao Liu
Kaixuan Ji
Yicheng Fu
Weng Tam
Zhengxiao Du
P-tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks10.18653/v1/2022.acl-short.82022