Pretrained Large Language Model (LLM): Difference between revisions
Jump to navigation
Jump to search
m (Fixing double redirect from Pretrained Large Language Model (Pretrained LLM) to Pretrained Large Neural Language Model (Pretrained LLM).) Tag: Redirect target changed |
(Removed redirect to Pretrained Large Neural Language Model (Pretrained LLM)) Tag: Removed redirect |
||
Line 1: | Line 1: | ||
A [[Pretrained Large Language Model (LLM)]] is a [[pretrained language model]] that is a [[large language model]]. | |||
* <B>Context:</B> | |||
** It can be an input to a [[In-Context Learning System]]. | |||
** It can be an input to a [[LLM Fine-Tuning System]]. | |||
** ... | |||
** It can range from being a [[Pure Pretrained LLM]] to being a [[Finetuned LLM]] (such as an [[instruction-tuned LLM]]). | |||
** ... | |||
* <B>Example(s):</B> | |||
** a [[General Purpose Pretrained LLM]], such as [[GPT-4]]. | |||
** a [[Domain-Specific Pretrained LLM]], such as: | |||
*** a [[Pretrained Biomedical LLM]] (e.g. [[BioGPT]]) or a [[Pretrained Protein LLM]]. | |||
*** a [[Pretrained Software LLM]], such as [[Codex LLM]]. | |||
*** a [[Pretrained Finance LLM]], such as [[Bloomberg LLM]]. | |||
*** a [[Pretrained Legal LLM]], such as [[]]. | |||
** a [[Proprietary Pretrained LLM]], such as: | |||
*** a [[Google Pretrained LLM]], [[Azure Pretrained LLM]], ... | |||
** a [[Base LLM]], such as: [[llama31-405b-base-bf-16]]. | |||
** … | |||
* <B>Counter-Example(s):</B> | |||
** a [[Pre-Trained Small Language Model]]. | |||
** a [[Pre-Trained Image Generation Model]]. | |||
* <B>See:</B> [[Language Model Metamodel]], [[LLM Architecture]], [[ULMFiT]]. | |||
---- | |||
---- | |||
== References == | |||
=== 2023 === | |||
* (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Large_language_model#List_of_large_language_models Retrieved:2023-3-19. | |||
{| class="wikitable sortable" | |||
|+ List of large language models | |||
|- | |||
! Name !! Release date{{efn|This is the date that documentation describing the model's architecture was first released.}} !! Developer !! Number of parameters{{efn|In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.}} !! Corpus size !! License{{efn|This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.}} !! Notes | |||
|- | |||
| [[BERT (language model)|BERT]] || 2018 || [[Google]] || 340 million<ref name=bert-paper/> || 3.3 billion words<ref name=bert-paper/> || {{yes|Apache 2.0}}<ref name=bert-web>{{Cite web|url=https://github.com/google-research/bert|title=BERT|date=March 13, 2023|via=GitHub}}</ref> | |||
| early and influential language model<ref name=Manning-2022/> | |||
|- | |||
| [[GPT-2]] || 2019 || [[OpenAI]] || 1.5 billion<ref name="15Brelease"/> || 40GB<ref>{{cite web |title=Better language models and their implications |url=https://openai.com/research/better-language-models |website=openai.com}}</ref> (~10 billion tokens)<ref name="LambdaLabs">{{cite web |title=OpenAI's GPT-3 Language Model: A Technical Overview |url=https://lambdalabs.com/blog/demystifying-gpt-3 |website=lambdalabs.com |language=en}}</ref> || {{yes|MIT}}<ref>{{cite web|work=GitHub|title=gpt-2|url=https://github.com/openai/gpt-2|access-date=13 March 2023}}</ref> | |||
| general-purpose model based on transformer architecture | |||
|- | |||
| [[GPT-3]] || 2020 || OpenAI || 175 billion || 499 billion tokens<ref name="LambdaLabs"/> || {{public web API}} | |||
| A fine-tuned variant of [[GPT-3]], termed GPT-3.5, was made available to the public through a web interface called [[ChatGPT]] in 2022.<ref name=chatgpt-blog/> | |||
|- | |||
| [[GPT-Neo]] || March 2021 || [[EleutherAI]] || 2.7 billion<ref name="gpt-neo">{{Cite web|url=https://github.com/EleutherAI/gpt-neo|title=GPT Neo|date=March 15, 2023|via=GitHub}}</ref> || 825 GiB<ref name="Pile">{{cite arxiv |last1=Gao |first1=Leo |last2=Biderman |first2=Stella |last3=Black |first3=Sid |last4=Golding |first4=Laurence |last5=Hoppe |first5=Travis |last6=Foster |first6=Charles |last7=Phang |first7=Jason |last8=He |first8=Horace |last9=Thite |first9=Anish |last10=Nabeshima |first10=Noa |last11=Presser |first11=Shawn |last12=Leahy |first12=Connor |title=The Pile: An 800GB Dataset of Diverse Text for Language Modeling |arxiv=2101.00027|date=31 December 2020 }}</ref> || {{yes|MIT}}<ref name=vb-gpt-neo/> | |||
| The first of [[EleutherAI#GPT-3 Replications|a series of free GPT-3 alternatives]] released by EleutherAI. GPT-Neo outperformed an equivalent-size [[GPT-3 model]] on some benchmarks, but was significantly worse than the largest GPT-3.<ref name=vb-gpt-neo/> | |||
|- | |||
| [[GPT-J]] || June 2021 || [[EleutherAI]] || 6 billion<ref>{{Cite web |title=GPT-J-6B: An Introduction to the Largest Open Source GPT Model {{!}} Forefront |url=https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model |access-date=2023-02-28 |website=www.forefront.ai |language=en}}</ref> || 825 GiB<ref name="Pile"/> || {{yes|Apache 2.0}} | |||
| GPT-3-style language model | |||
|- | |||
| Ernie 3.0 Titan || December 2021 || [[Baidu]] || 260 billion<ref>{{Cite web|url=https://www.wired.co.uk/article/chinas-chatgpt-black-market-baidu|title=China's ChatGPT Black Market Is Thriving|first=Condé|last=Nast|via=www.wired.co.uk}}</ref><ref>{{Cite journal|url=http://arxiv.org/abs/2112.12731|title=ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation|first1=Shuohuan|last1=Wang|first2=Yu|last2=Sun|first3=Yang|last3=Xiang|first4=Zhihua|last4=Wu|first5=Siyu|last5=Ding|first6=Weibao|last6=Gong|first7=Shikun|last7=Feng|first8=Junyuan|last8=Shang|first9=Yanbin|last9=Zhao|first10=Chao|last10=Pang|first11=Jiaxiang|last11=Liu|first12=Xuyi|last12=Chen|first13=Yuxiang|last13=Lu|first14=Weixin|last14=Liu|first15=Xi|last15=Wang|first16=Yangfan|last16=Bai|first17=Qiuliang|last17=Chen|first18=Li|last18=Zhao|first19=Shiyong|last19=Li|first20=Peng|last20=Sun|first21=Dianhai|last21=Yu|first22=Yanjun|last22=Ma|first23=Hao|last23=Tian|first24=Hua|last24=Wu|first25=Tian|last25=Wu|first26=Wei|last26=Zeng|first27=Ge|last27=Li|first28=Wen|last28=Gao|first29=Haifeng|last29=Wang|date=December 23, 2021|via=arXiv.org|arxiv=2112.12731}}</ref> || 4 Tb || {{no|Proprietary}} | |||
| Chinese-language LLM. [[Ernie Bot]] is based on this model. | |||
|- | |||
| [[Claude]]<ref>{{cite web |title=Product |url=https://www.anthropic.com/product |website=Anthropic |access-date=14 March 2023 |language=en}}</ref> || December 2021 || [[Anthropic]] || 52 billion<ref name="AnthroArch">{{cite arxiv |last1=Askell |first1=Amanda |last2=Bai |first2=Yuntao |last3=Chen |first3=Anna |last4=Drain |first4=Dawn |last5=Ganguli |first5=Deep |last6=Henighan |first6=Tom |last7=Jones |first7=Andy |last8=Joseph |first8=Nicholas |last9=Mann |first9=Ben |last10=DasSarma |first10=Nova |last11=Elhage |first11=Nelson |last12=Hatfield-Dodds |first12=Zac |last13=Hernandez |first13=Danny |last14=Kernion |first14=Jackson |last15=Ndousse |first15=Kamal |last16=Olsson |first16=Catherine |last17=Amodei |first17=Dario |last18=Brown |first18=Tom |last19=Clark |first19=Jack |last20=McCandlish |first20=Sam |last21=Olah |first21=Chris |last22=Kaplan |first22=Jared |display-authors=3 |title=A General Language Assistant as a Laboratory for Alignment |arxiv=2112.00861 |date=9 December 2021 }}</ref> || 400 billion tokens<ref name="AnthroArch"/> || {{Closed beta}} | |||
| fine-tuned for desirable behavior in conversations<ref>{{cite arxiv |last1=Bai |first1=Yuntao |last2=Kadavath |first2=Saurav |last3=Kundu |first3=Sandipan |last4=Askell |first4=Amanda |last5=Kernion |first5=Jackson |last6=Jones |first6=Andy |last7=Chen |first7=Anna |last8=Goldie |first8=Anna |last9=Mirhoseini |first9=Azalia |last10=McKinnon |first10=Cameron |last11=Chen |first11=Carol |last12=Olsson |first12=Catherine |last13=Olah |first13=Christopher |last14=Hernandez |first14=Danny |last15=Drain |first15=Dawn |last16=Ganguli |first16=Deep |last17=Li |first17=Dustin |last18=Tran-Johnson |first18=Eli |last19=Perez |first19=Ethan |last20=Kerr |first20=Jamie |last21=Mueller |first21=Jared |last22=Ladish |first22=Jeffrey |last23=Landau |first23=Joshua |last24=Ndousse |first24=Kamal |last25=Lukosuite |first25=Kamile |last26=Lovitt |first26=Liane |last27=Sellitto |first27=Michael |last28=Elhage |first28=Nelson |last29=Schiefer |first29=Nicholas |last30=Mercado |first30=Noemi |last31=DasSarma |first31=Nova |last32=Lasenby |first32=Robert |last33=Larson |first33=Robin |last34=Ringer |first34=Sam |last35=Johnston |first35=Scott |last36=Kravec |first36=Shauna |last37=Showk |first37=Sheer El |last38=Fort |first38=Stanislav |last39=Lanham |first39=Tamera |last40=Telleen-Lawton |first40=Timothy |last41=Conerly |first41=Tom |last42=Henighan |first42=Tom |last43=Hume |first43=Tristan |last44=Bowman |first44=Samuel R. |last45=Hatfield-Dodds |first45=Zac |last46=Mann |first46=Ben |last47=Amodei |first47=Dario |last48=Joseph |first48=Nicholas |last49=McCandlish |first49=Sam |last50=Brown |first50=Tom |last51=Kaplan |first51=Jared |display-authors=3 |title=Constitutional AI: Harmlessness from AI Feedback |arxiv=2212.08073 |date=15 December 2022 }}</ref> | |||
|- | |||
| [[GLaM]] (Generalist Language Model) || December 2021 || Google || 1.2 trillion<ref name=glam-blog/> || 1.6 trillion tokens<ref name=glam-blog/> || {{no|Proprietary}} | |||
| sparse mixture-of-experts model, making it more expensive to train but cheaper to run inference compared to GPT-3 | |||
|- | |||
| [[LaMDA]] (Language Models for Dialog Applications) || January 2022 || Google || 137 billion<ref name=lamda-blog/> || 1.56T words<ref name=lamda-blog/> || {{no|Proprietary}} | |||
| specialized for response generation in conversations | |||
|- | |||
| [[Megatron-Turing NLG]] || October 2021<ref>{{cite web |last1=Alvi |first1=Ali |last2=Kharya |first2=Paresh |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model |url=https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ |website=Microsoft Research |date=11 October 2021}}</ref> || [[Microsoft]] and [[Nvidia]] || 530 billion<ref name=mtnlg-preprint/> || 338.6 billion tokens<ref name=mtnlg-preprint/> || {{no|Restricted web access}} | |||
| standard architecture but trained on a supercomputing cluster | |||
|- | |||
| [[GPT-NeoX]] || February 2022 || [[EleutherAI]] || 20 billion<ref name=“gpt-neox-20b”>{{cite conference |title=GPT-NeoX-20B: An Open-Source Autoregressive Language Model |conference=Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models |date=2022-05-01 |last=Black |first=Sidney |last2=Biderman |first2=Stella |last3=Hallahan |first3=Eric |display-authors=etal |volume=Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models |pages=95-136 |url=https://aclanthology.org/2022.bigscience-1.9/ |accessdate=2022-12-19 }}</ref> || 825 GiB<ref name="Pile"/> || {{yes|Apache 2.0}} | |||
| based on the Megatron architecture | |||
|- | |||
| [[Chinchilla AI|Chinchilla]] || March 2022 || [[DeepMind]] || 70 billion<ref name=chinchilla-blog/> || 1.3 trillion tokens<ref name=chinchilla-blog/><ref>{{cite arxiv |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |arxiv=2203.15556 |date=29 March 2022}}</ref> || {{no|Proprietary}} | |||
| reduced-parameter model trained on more data | |||
|- | |||
| [[PaLM]] (Pathways Language Model) || April 2022 || Google || 540 billion<ref name=palm-blog/> || 768 billion tokens<ref name=chinchilla-blog/> || {{no|Proprietary}} | |||
| aimed to reach the practical limits of model scale | |||
|- | |||
| [[OPT (Open Pretrained Transformer)]] || May 2022 || [[Meta Platforms|Meta]] || 175 billion<ref>{{cite web |title=Democratizing access to large-scale language models with OPT-175B |url=https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ |website=ai.facebook.com |language=en}}</ref> || 180 billion tokens<ref>{{cite arxiv |last1=Zhang |first1=Susan |last2=Roller |first2=Stephen |last3=Goyal |first3=Naman |last4=Artetxe |first4=Mikel |last5=Chen |first5=Moya |last6=Chen |first6=Shuohui |last7=Dewan |first7=Christopher |last8=Diab |first8=Mona |last9=Li |first9=Xian |last10=Lin |first10=Xi Victoria |last11=Mihaylov |first11=Todor |last12=Ott |first12=Myle |last13=Shleifer |first13=Sam |last14=Shuster |first14=Kurt |last15=Simig |first15=Daniel |last16=Koura |first16=Punit Singh |last17=Sridhar |first17=Anjali |last18=Wang |first18=Tianlu |last19=Zettlemoyer |first19=Luke |title=OPT: Open Pre-trained Transformer Language Models |arxiv=2205.01068 |date=21 June 2022}}</ref> || {{Non-commercial research}}{{efn|The smaller models including 66B are publicly available, while the 175B model is available on request.}} | |||
| GPT-3 architecture with some adaptations from Megatron | |||
|- | |||
|YaLM 100B | |||
|June 2022 | |||
|[[Yandex]] | |||
|100 billion<ref name=":0">{{Citation |last=Khrushchev |first=Mikhail |title=YaLM 100B |date=2022-06-22 |url=https://github.com/yandex/YaLM-100B |access-date=2023-03-18 |last2=Vasilev |first2=Ruslan |last3=Petrov |first3=Alexey |last4=Zinov |first4=Nikolay}}</ref> | |||
|1.7TB<ref name=":0" /> | |||
|{{Yes|Apache 2.0}} | |||
|English-Russian model | |||
|- | |||
| [[BLOOM (language model)|BLOOM]] || July 2022 || Large collaboration led by [[Hugging Face]] || 175 billion<ref name=bigger-better/> || 350 billion tokens (1.6TB)<ref>{{cite web |title=bigscience/bloom · Hugging Face |url=https://huggingface.co/bigscience/bloom |website=huggingface.co}}</ref> || {{yes|Responsible AI}} | |||
| Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) | |||
|- | |||
| [[AlexaTM (Teacher Models)]] || November 2022 || [[Amazon (company)|Amazon]] || 20 billion<ref>{{cite web |title=20B-parameter Alexa model sets new marks in few-shot learning |url=https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning |website=Amazon Science |language=en |date=2 August 2022}}</ref> || 1.3 trillion<ref>{{cite arxiv |last1=Soltan |first1=Saleh |last2=Ananthakrishnan |first2=Shankar |last3=FitzGerald |first3=Jack |last4=Gupta |first4=Rahul |last5=Hamza |first5=Wael |last6=Khan |first6=Haidar |last7=Peris |first7=Charith |last8=Rawls |first8=Stephen |last9=Rosenbaum |first9=Andy |last10=Rumshisky |first10=Anna |last11=Prakash |first11=Chandana Satya |last12=Sridhar |first12=Mukund |last13=Triefenbach |first13=Fabian |last14=Verma |first14=Apurv |last15=Tur |first15=Gokhan |last16=Natarajan |first16=Prem |display-authors=3|title=AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model |arxiv=2208.01448 |date=3 August 2022}}</ref> || {{public web API}}<ref>{{cite web |title=AlexaTM 20B is now available in Amazon SageMaker JumpStart {{!}} AWS Machine Learning Blog |url=https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ |website=aws.amazon.com |access-date=13 March 2023 |date=17 November 2022}}</ref> | |||
| bidirectional sequence-to-sequence architecture | |||
|- | |||
| [[LLaMA]] (Large Language Model Meta AI) || February 2023 || [[Meta Platforms|Meta]] || 65 billion<ref name=llama-blog/> || 1.4 trillion<ref name=llama-blog/> || {{Non-commercial research}}{{efn|Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.}} | |||
| trained on a large 20-language corpus to aim for better performance with fewer parameters.<ref name=llama-blog/> | |||
|- | |||
| [[GPT-4]] || March 2023 || OpenAI || Unknown{{efn|As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like [[GPT-4]], this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."<ref name="GPT4Tech">{{Cite web |date=2023 |title=GPT-4 Technical Report |url=https://cdn.openai.com/papers/gpt-4.pdf |website=[[OpenAI]] |access-date=March 14, 2023 |archive-date=March 14, 2023 |archive-url=https://web.archive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf |url-status=live }}</ref>}} || Unknown || {{public web API}} | |||
| Available for ChatGPT Plus users. Microsoft confirmed that [[GPT-4 model]] is used in [[Bing Chat]].<ref>{{Cite web |date=March 14, 2023 |url=https://techcrunch.com/2023/03/14/microsofts-new-bing-was-using-gpt-4-all-along/ |title=Microsoft’s new Bing was using [[GPT-4]] all along |last=Lardinois |first=Frederic |website=TechCrunch |access-date=March 14, 2023 |archive-date=March 15, 2023 |archive-url=https://web.archive.org/web/20230315013650/https://techcrunch.com/2023/03/14/microsofts-new-bing-was-using-gpt-4-all-along/ |url-status=live }}</ref> | |||
|} | |||
=== 2023 === | |||
* ([[Zhao, Zhou et al., 2023]]) ⇒ [[Wayne Xin Zhao]], [[Kun Zhou]], [[Junyi Li]], [[Tianyi Tang]], [[Xiaolei Wang]], [[Yupeng Hou]], [[Yingqian Min]], [[Beichen Zhang]], [[Junjie Zhang]], [[Zican Dong]], Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and [[Ji-Rong Wen]]. ([[2023]]). “[https://arxiv.org/pdf/2303.18223.pdf A Survey of Large Language Models].” In: arXiv preprint arXiv:2303.18223. [http://dx.doi.org/10.48550/arXiv.2303.18223 doi:10.48550/arXiv.2303.18223] | |||
=== 2022 === | |||
* ([[Li, Tang et al., 2021]]) ⇒ [[Junyi Li]], [[Tianyi Tang]], [[Wayne Xin Zhao]], [[Jian-Yun Nie]], and [[Ji-Rong Wen]]. ([[2021]]). “Pretrained Language Models for Text Generation: A Survey.” arXiv:2105.10311 https://doi.org/10.48550/arXiv.2201.05273 | |||
** ABSTRACT: Text Generation aims to produce plausible and readable text in a human language from input data. The resurgence of deep learning has greatly advanced this field, in particular, with the help of neural generation models based on [[pre-trained language models (PLMs)]]. Text generation based on [[PLM]]s is viewed as a promising approach in both academia and industry. In this paper, we provide a survey on the utilization of [[PLM]]s in text generation. We begin with introducing three key aspects of applying [[PLM]]s to text generation: 1) how to encode the input into representations preserving input semantics which can be fused into PLMs; 2) how to design an effective PLM to serve as the generation model; and 3) how to effectively optimize [[PLM]]s given the reference text and to ensure that the generated texts satisfy special text properties. Then, we show the major challenges arisen in these aspects, as well as possible solutions for them. We also include a summary of various useful resources and typical text generation applications based on PLMs. Finally, we highlight the future research directions which will further improve these [[PLM]]s for text generation. This comprehensive survey is intended to help researchers interested in text generation problems to learn the core concepts, the main techniques and the latest developments in this area based on PLMs. | |||
---- | |||
__NOTOC__ | |||
[[Category:Concept]] | |||
[[Category:Quality Silver]] |
Latest revision as of 02:38, 7 February 2025
A Pretrained Large Language Model (LLM) is a pretrained language model that is a large language model.
- Context:
- It can be an input to a In-Context Learning System.
- It can be an input to a LLM Fine-Tuning System.
- ...
- It can range from being a Pure Pretrained LLM to being a Finetuned LLM (such as an instruction-tuned LLM).
- ...
- Example(s):
- a General Purpose Pretrained LLM, such as GPT-4.
- a Domain-Specific Pretrained LLM, such as:
- a Pretrained Biomedical LLM (e.g. BioGPT) or a Pretrained Protein LLM.
- a Pretrained Software LLM, such as Codex LLM.
- a Pretrained Finance LLM, such as Bloomberg LLM.
- a Pretrained Legal LLM, such as [[]].
- a Proprietary Pretrained LLM, such as:
- a Base LLM, such as: llama31-405b-base-bf-16.
- …
- Counter-Example(s):
- See: Language Model Metamodel, LLM Architecture, ULMFiT.
References
2023
- (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Large_language_model#List_of_large_language_models Retrieved:2023-3-19.
Name | Release dateTemplate:Efn | Developer | Number of parametersTemplate:Efn | Corpus size | LicenseTemplate:Efn | Notes |
---|---|---|---|---|---|---|
BERT | 2018 | 340 million[1] | 3.3 billion words[1] | Apache 2.0[2] | early and influential language model[3] | |
GPT-2 | 2019 | OpenAI | 1.5 billion[4] | 40GB[5] (~10 billion tokens)[6] | MIT[7] | general-purpose model based on transformer architecture |
GPT-3 | 2020 | OpenAI | 175 billion | 499 billion tokens[6] | Template:Public web API | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[8] |
GPT-Neo | March 2021 | EleutherAI | 2.7 billion[9] | 825 GiB[10] | MIT[11] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[11] |
GPT-J | June 2021 | EleutherAI | 6 billion[12] | 825 GiB[10] | Apache 2.0 | GPT-3-style language model |
Ernie 3.0 Titan | December 2021 | Baidu | 260 billion[13][14] | 4 Tb | Proprietary | Chinese-language LLM. Ernie Bot is based on this model. |
Claude[15] | December 2021 | Anthropic | 52 billion[16] | 400 billion tokens[16] | Template:Closed beta | fine-tuned for desirable behavior in conversations[17] |
GLaM (Generalist Language Model) | December 2021 | 1.2 trillion[18] | 1.6 trillion tokens[18] | Proprietary | sparse mixture-of-experts model, making it more expensive to train but cheaper to run inference compared to GPT-3 | |
LaMDA (Language Models for Dialog Applications) | January 2022 | 137 billion[19] | 1.56T words[19] | Proprietary | specialized for response generation in conversations | |
Megatron-Turing NLG | October 2021[20] | Microsoft and Nvidia | 530 billion[21] | 338.6 billion tokens[21] | Restricted web access | standard architecture but trained on a supercomputing cluster |
GPT-NeoX | February 2022 | EleutherAI | 20 billion[22] | 825 GiB[10] | Apache 2.0 | based on the Megatron architecture |
Chinchilla | March 2022 | DeepMind | 70 billion[23] | 1.3 trillion tokens[23][24] | Proprietary | reduced-parameter model trained on more data |
PaLM (Pathways Language Model) | April 2022 | 540 billion[25] | 768 billion tokens[23] | Proprietary | aimed to reach the practical limits of model scale | |
OPT (Open Pretrained Transformer) | May 2022 | Meta | 175 billion[26] | 180 billion tokens[27] | Template:Non-commercial researchTemplate:Efn | GPT-3 architecture with some adaptations from Megatron |
YaLM 100B | June 2022 | Yandex | 100 billion[28] | 1.7TB[28] | Apache 2.0 | English-Russian model |
BLOOM | July 2022 | Large collaboration led by Hugging Face | 175 billion[29] | 350 billion tokens (1.6TB)[30] | Responsible AI | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) |
AlexaTM (Teacher Models) | November 2022 | Amazon | 20 billion[31] | 1.3 trillion[32] | Template:Public web API[33] | bidirectional sequence-to-sequence architecture |
LLaMA (Large Language Model Meta AI) | February 2023 | Meta | 65 billion[34] | 1.4 trillion[34] | Template:Non-commercial researchTemplate:Efn | trained on a large 20-language corpus to aim for better performance with fewer parameters.[34] |
GPT-4 | March 2023 | OpenAI | UnknownTemplate:Efn | Unknown | Template:Public web API | Available for ChatGPT Plus users. Microsoft confirmed that GPT-4 model is used in Bing Chat.[35] |
2023
- (Zhao, Zhou et al., 2023) ⇒ Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. (2023). “A Survey of Large Language Models.” In: arXiv preprint arXiv:2303.18223. doi:10.48550/arXiv.2303.18223
2022
- (Li, Tang et al., 2021) ⇒ Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. (2021). “Pretrained Language Models for Text Generation: A Survey.” arXiv:2105.10311 https://doi.org/10.48550/arXiv.2201.05273
- ABSTRACT: Text Generation aims to produce plausible and readable text in a human language from input data. The resurgence of deep learning has greatly advanced this field, in particular, with the help of neural generation models based on pre-trained language models (PLMs). Text generation based on PLMs is viewed as a promising approach in both academia and industry. In this paper, we provide a survey on the utilization of PLMs in text generation. We begin with introducing three key aspects of applying PLMs to text generation: 1) how to encode the input into representations preserving input semantics which can be fused into PLMs; 2) how to design an effective PLM to serve as the generation model; and 3) how to effectively optimize PLMs given the reference text and to ensure that the generated texts satisfy special text properties. Then, we show the major challenges arisen in these aspects, as well as possible solutions for them. We also include a summary of various useful resources and typical text generation applications based on PLMs. Finally, we highlight the future research directions which will further improve these PLMs for text generation. This comprehensive survey is intended to help researchers interested in text generation problems to learn the core concepts, the main techniques and the latest developments in this area based on PLMs.
- ↑ Jump up to: 1.0 1.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedbert-paper
- ↑ "BERT". March 13, 2023. https://github.com/google-research/bert.
- ↑ Cite error: Invalid
<ref>
tag; no text was provided for refs namedManning-2022
- ↑ Cite error: Invalid
<ref>
tag; no text was provided for refs named15Brelease
- ↑ "Better language models and their implications". https://openai.com/research/better-language-models.
- ↑ Jump up to: 6.0 6.1 "OpenAI's GPT-3 Language Model: A Technical Overview" (in en). https://lambdalabs.com/blog/demystifying-gpt-3.
- ↑ "gpt-2". GitHub. https://github.com/openai/gpt-2. Retrieved 13 March 2023.
- ↑ Cite error: Invalid
<ref>
tag; no text was provided for refs namedchatgpt-blog
- ↑ "GPT Neo". March 15, 2023. https://github.com/EleutherAI/gpt-neo.
- ↑ Jump up to: 10.0 10.1 10.2 Template:Cite arxiv
- ↑ Jump up to: 11.0 11.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedvb-gpt-neo
- ↑ "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront" (in en). https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model. Retrieved 2023-02-28.
- ↑ Nast, Condé. "China's ChatGPT Black Market Is Thriving". https://www.wired.co.uk/article/chinas-chatgpt-black-market-baidu.
- ↑ Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang, Junyuan et al. (December 23, 2021). ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. arXiv:2112.12731. http://arxiv.org/abs/2112.12731.
- ↑ "Product" (in en). https://www.anthropic.com/product. Retrieved 14 March 2023.
- ↑ Jump up to: 16.0 16.1 Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Jump up to: 18.0 18.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedglam-blog
- ↑ Jump up to: 19.0 19.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedlamda-blog
- ↑ Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model". https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/.
- ↑ Jump up to: 21.0 21.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedmtnlg-preprint
- ↑ Template:Cite conference
- ↑ Jump up to: 23.0 23.1 23.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedchinchilla-blog
- ↑ Template:Cite arxiv
- ↑ Cite error: Invalid
<ref>
tag; no text was provided for refs namedpalm-blog
- ↑ "Democratizing access to large-scale language models with OPT-175B" (in en). https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/.
- ↑ Template:Cite arxiv
- ↑ Jump up to: 28.0 28.1 Template:Citation
- ↑ Cite error: Invalid
<ref>
tag; no text was provided for refs namedbigger-better
- ↑ "bigscience/bloom · Hugging Face". https://huggingface.co/bigscience/bloom.
- ↑ "20B-parameter Alexa model sets new marks in few-shot learning" (in en). 2 August 2022. https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning.
- ↑ Template:Cite arxiv
- ↑ "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". 17 November 2022. https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/. Retrieved 13 March 2023.
- ↑ Jump up to: 34.0 34.1 34.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedllama-blog
- ↑ Lardinois, Frederic (March 14, 2023). "Microsoft’s new Bing was using GPT-4 all along". https://techcrunch.com/2023/03/14/microsofts-new-bing-was-using-gpt-4-all-along/. Retrieved March 14, 2023.