Vision-Language-Action Model
(Redirected from Embodied Language Model)
Jump to navigation
Jump to search
A Vision-Language-Action Model is a multimodal AI model that integrates visual perception, language understanding, and physical action for embodied AI tasks.
- AKA: VLA Model, Vision-Language-Action (VLA) Model, Multimodal Action Model, Embodied Language Model.
- Context:
- It can typically process Visual Inputs including camera feeds, depth maps, and object detections.
- It can typically understand Natural Language Instructions for task specification and goal definition.
- It can typically generate Robot Actions through motor commands and trajectory planning.
- It can often enable Real-Time Control through low-latency inference and continuous adaptation.
- It can often support Multi-Task Learning across manipulation tasks, navigation tasks, and interaction tasks.
- It can often integrate with Reinforcement Learning for policy optimization and reward modeling.
- It can range from being a Simple Pick-and-Place Model to being a Complex Dexterous Manipulation Model, depending on its action complexity.
- It can range from being a Single-Modal Input Model to being a Full-Multimodal Model, depending on its sensory integration.
- It can range from being a Reactive Control Model to being a Predictive Planning Model, depending on its temporal horizon.
- It can range from being a Task-Specific Model to being a General-Purpose Model, depending on its capability scope.
- ...
- Examples:
- Google Vision-Language-Action Models, such as:
- Academic Vision-Language-Action Models, such as:
- Commercial Vision-Language-Action Models, such as:
- ...
- Counter-Examples:
- Vision-Only Model, which lacks action generation.
- Language-Only Model, which lacks visual perception.
- Pure Reinforcement Learning Model, which lacks language grounding.
- See: Multimodal Model, Robotics Model, Embodied AI Model, AI World Model, Genie World Model, Intuitive Physics Understanding Task, Hybrid AI Model, Transformer Model, Computer Vision Model.