Vision-Language-Action Model

From GM-RKB

Jump to navigation Jump to search

A Vision-Language-Action Model is a multimodal AI model that integrates visual perception, language understanding, and physical action for embodied AI tasks.

AKA: VLA Model, Vision-Language-Action (VLA) Model, Multimodal Action Model, Embodied Language Model.
Context:
- It can typically process Visual Inputs including camera feeds, depth maps, and object detections.
- It can typically understand Natural Language Instructions for task specification and goal definition.
- It can typically generate Robot Actions through motor commands and trajectory planning.
- It can often enable Real-Time Control through low-latency inference and continuous adaptation.
- It can often support Multi-Task Learning across manipulation tasks, navigation tasks, and interaction tasks.
- It can often integrate with Reinforcement Learning for policy optimization and reward modeling.
- It can range from being a Simple Pick-and-Place Model to being a Complex Dexterous Manipulation Model, depending on its action complexity.
- It can range from being a Single-Modal Input Model to being a Full-Multimodal Model, depending on its sensory integration.
- It can range from being a Reactive Control Model to being a Predictive Planning Model, depending on its temporal horizon.
- It can range from being a Task-Specific Model to being a General-Purpose Model, depending on its capability scope.
- ...
Examples:
- Google Vision-Language-Action Models, such as:
  - RT-1 Model for robotic transformer.
  - RT-2 Model for vision-language-action.
  - Gemini Robotics Model for advanced manipulation.
- Academic Vision-Language-Action Models, such as:
  - CLIP-based VLA Model for visual grounding.
  - Flamingo VLA Model for few-shot learning.
- Commercial Vision-Language-Action Models, such as:
  - Tesla FSD Model for autonomous driving.
  - Boston Dynamics Model for dynamic locomotion.
- ...
Counter-Examples:
- Vision-Only Model, which lacks action generation.
- Language-Only Model, which lacks visual perception.
- Pure Reinforcement Learning Model, which lacks language grounding.
See: Multimodal Model, Robotics Model, Embodied AI Model, AI World Model, Genie World Model, Intuitive Physics Understanding Task, Hybrid AI Model, Transformer Model, Computer Vision Model.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Vision-Language-Action_Model&oldid=976367"