2024 GenieGenerativeInteractiveEnvir

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

  • It introduces Genie, a groundbreaking generative AI framework for generating interactive, controllable video generation environments from text, images, or sketches, eliminating the need for action annotations or text annotations during training.
  • It employs a comprehensive model architecture consisting of three primary components: a video tokenizer, a latent action model, and a dynamics model, each utilizing memory-efficient spatiotemporal transformers for effective temporal dynamics capture.
  • It showcases the ability to generate high-quality, controllable videos across various domains from different image prompts, including text-to-image outputs, sketches, and photos, and to model complex physical phenomena and object interactions accurately.
  • It achieves notable quantitative results with an 11 billion parameter model achieving an FVD score of 40.1 on a filtered 30k hour gaming video dataset, demonstrating Genie's proficiency in creating realistic and dynamic virtual environments.
  • It highlights the latent action space as a key feature that enables imitation learning in unseen environments, pushing the frontiers of generative video modeling and world simulation without requiring costly action annotations.
  • It discusses the societal impact of Genie, emphasizing its potential to augment human creativity and find applications in gaming and simulation industries, while also stressing the importance of responsible and ethical usage.
  • It opts not to release model weights or training data at this time, advocating for further research into the safe and ethical deployment of generative interactive environments.

Cited By

Quotes

Abstract

We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 GenieGenerativeInteractiveEnvirSimon Osindero
Jeff Clune
Scott Reed
Nando de Freitas
Sherjil Ozair
Matthew Lai
Tim Rocktäschel
Nicolas Heess
Jake Bruce
Michael Dennis
Ashley Edwards
Jack Parker-Holder
Yuge Shi
Edward Hughes
Aditi Mavalankar
Richie Steigerwald
Chris Apps
Yusuf Aytar
Sarah Bechtle
Feryal Behbahani
Stephanie Chan
Lucy Gonzalez
Jingwei Zhang
Konrad Zolna
Satinder Singh
Genie: Generative Interactive Environments2024