Multi-Modal Large Language Model (MLLM)

From GM-RKB
Jump to navigation Jump to search

A Multi-Modal Large Language Model (MLLM) is a large language model that is a text-to-* model.



References

2023

  • (Koh et al., 2023) ⇒ JY Koh, R Salakhutdinov, D Fried. (2023). “Grounding Language Models to Images for Multimodal Generation." In: arXiv preprint arXiv:2301.13823.
    • QUOTE: “… language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model … This allows our model to process …”
    • NOTE: It discusses grounding language models to images, exploring techniques in large-scale text-only pretraining for multimodal generation.

2023

2023

2023

2023

2022

2021

2021

2014

2014

  • (Kiros et al., 2014) ⇒ R Kiros, R Salakhutdinov, and others. (2014). “Multimodal Neural Language Models." In: Machine Learning, proceedings.mlr.press.
    • QUOTE: “… This work takes a first step towards generating image descriptions with a multimodal language model and sets a baseline when no additional structures are used. For future work