Vision-Language Pre-training Task
Jump to navigation
Jump to search
A Vision-Language Pre-training Task is a pre-training task that aligns visual representations with textual representations through contrastive learning objectives for cross-modal understanding tasks.
- AKA: VL Pre-training, Image-Language Pre-training, Visual-Linguistic Pre-training Task, Cross-Modal Pre-training, Multimodal Pre-training Task.
- Context:
- It can typically support Zero-Shot Image Classification Tasks through text-based classifiers.
- It can typically enable Image-Text Retrieval Tasks with semantic alignment mechanisms.
- It can often utilize Contrastive Learning Losses for representation alignment.
- It can often employ Large-Scale Image-Text Datasets like LAION Datasets.
- It can integrate Transformer Architectures for multi-modal encoding.
- It can support Differentiable Prompt Learning Techniques through learned embedding spaces.
- It can enable Cross-Modal Transfer Learning Tasks with shared representations.
- It can range from being a Small-Scale Vision-Language Pre-training Task to being a Web-Scale Vision-Language Pre-training Task, depending on its dataset size.
- It can range from being a Single-Objective Vision-Language Pre-training Task to being a Multi-Objective Vision-Language Pre-training Task, depending on its training objective diversity.
- It can range from being a Symmetric Vision-Language Pre-training Task to being a Asymmetric Vision-Language Pre-training Task, depending on its encoder architecture.
- It can range from being a Frozen-Encoder Vision-Language Pre-training Task to being a End-to-End Vision-Language Pre-training Task, depending on its training strategy.
- ...
- Example(s):
- CLIP Pre-training, using contrastive image-text matching.
- ALIGN Pre-training, with noisy web-scale data.
- FILIP Pre-training, using fine-grained token matching.
- Florence Pre-training, Microsoft's unified vision model.
- CoCa Pre-training, combining contrastive and captioning.
- BLIP Pre-training, with bootstrapped learning.
- ...
- Counter-Example(s):
- Vision-Only Pre-training Task, such as ImageNet pre-training.
- Text-Only Pre-training Task, such as masked language modeling.
- Supervised Image Classification Task, with fixed categories.
- Caption Generation Task, focusing on text generation only.
- See: Pre-training Task, Vision-Language Model, CLIP Model, Contrastive Learning, Multi-Modal Learning, Zero-Shot Learning, Cross-Modal Retrieval, Transformer Architecture, Image-Text Dataset, Differentiable Prompt Learning Technique, Multimodal Contrastive Learning Task.