Topic 24: What is Cosmos World Foundation Model Platform?

Community Article Published January 23, 2025

World models are the next big thing that enables Physical AI. Let's explore how NVIDIA makes it happen


🔳 Turing Post is on 🤗 Hugging Face as a resident -> click to follow!


We recently discussed Physical AI and Jensen Huang's vision for achieving it through Agentic AI. At its core, Physical AI refers to systems capable of understanding and engaging with the physical world, leveraging advanced technologies like sensor-driven agents, robotics, and physical simulation platforms. While still emerging, the growing focus on agents and robotics signals meaningful progress toward this ambitious vision.

But this progress depends on the development of World Foundation Models (WFMs) – AI systems trained to simulate real-world environments and predict outcomes from text, image, or video inputs. These models are key to creating physics-aware videos, enabling AI to better understand and interact with the physical world. There is a lot to solve there!

Just two weeks ago, NVIDIA unveiled not just a model but an entire ecosystem – they called it Cosmos. This new platform, complete with three WFMs, was also open-sourced (to be more precise it’s available under the NVIDIA Open Model License.)

Even if you’re not building robots, understanding the technologies shaping Physical AI – like NVIDIA’s Cosmos – matters. Why? Because these innovations are rewriting how AI systems learn, interact, and solve real-world problems. From smarter automation to groundbreaking simulations, the ripple effects will touch every corner of AI. Let’s dive into its components and explore the transformative potential it holds for the AI landscape in general and Physical AI in particular.

In today’s episode, we will cover:


Do you like Turing Post? –> Click 'Follow'! And subscribe to receive it straight into your inbox -> https://www.turingpost.com/subscribe


What is Physical AI? A quick reminder

Let’s begin with the basic concept to clarify what the Cosmos WFM platform works with. Physical AI refers to AI systems equipped with sensors to perceive their environment and actuators to interact with and alter it. Embodied AI agents and robots are prime examples of this domain, designed to handle tasks that are dangerous, exhausting, or repetitive for humans.

Despite rapid advancements in many areas of AI, Physical AI has lagged behind. Mastering the complexities of physical reality remains an extraordinary challenge, requiring systems that can not only process vast sensory data but also make intelligent decisions in dynamic environments.

A crucial step toward achieving Physical AI is the development of Agentic AI – autonomous systems with the cognitive and decision-making capabilities needed to power embodied AI. These systems bridge the gap between perception and action, enabling more sophisticated interactions with the physical world.

One major obstacle is the difficulty of collecting training data for Physical AI. Real-world experimentation is often risky, expensive, and time-intensive, requiring detailed sequences of observations and actions. A promising solution to this challenge lies in World Foundation Models (WFMs)

World Foundation Models (WFMs)

World Foundation Model (WFM) is a digital replica of the physical world where Physical AI can safely learn and practice. Some of the WFMs, like Google DeepMind’s Genie 2 and AI system from World Labs co-founded by Fei-Fei Li (we discussed them in one of our FODs, work with images or text, meaning they can generate 3D environments from a single image or text prompt. They provide the ability to interact with these environments and objects, and can even add physical effects.

Cosmos World Foundation Model Platform, in its turn, focuses on visual WFMs, which work with videos to simulate and train AI systems. But how does WFM work in general?

WFM can predict what happens next in the world. It uses past observations, x(o:t), (in case of Cosmos WFM, it’s a video) and a change or action c(t), called a perturbation, to predict future observations x(t+1). For example, if you show the model a video of a ball rolling (past observation) and tell it someone will push the ball (perturbation), it predicts how the ball will move next.

image/png Image credit: The original paper

Here’s a summary of general benefits of WFMs:

  • They save time and resources for training.
  • WFMs provide a safe and efficient training ground for AI systems to learn how to act in various situations.
  • WFMs create realistic synthetic data for training AI systems to train a wide range of actions.

Now, it’s time to return to the Cosmos WFM platform and explore its features and capabilities.

How does Cosmos WFM platform work?

NVIDIA provided an easy scheme showing what Cosmos platform consists of. It includes tools and models to create, train, and use WFMs for Physical AI:

image/png Image credit: The original paper

  1. Video Curator: Extracts high-quality, dynamic video clips from large datasets for training WFMs. It removes duplicates to ensure a diverse and compact training dataset.
  2. Tokenizers: Compresses video data into "tokens" (small, manageable chunks) while keeping essential details. This makes training faster and more efficient.
  3. Pre-training WFMs: Two main methods are used:
  • Diffusion models: Gradually refine noisy video simulations into realistic ones.
  • Autoregressive Models: Build video sequences step by step. These models are trained on vast video datasets to learn general patterns of how the world works. They both rely on tokens to manage computational complexity.
  1. Post-training WFMs: Fine-tunes the pre-trained WFMs for specific tasks, such as simulating robot movements, navigating a virtual world or autonomous driving.
  2. Guardrail: This system ensures that the models avoid harmful inputs and outputs, protecting developers and systems during use.

Let's discuss everything in order.

Video Curator

As mentioned earlier, creating high-quality training data is the main purpose of the WFM. This is why the video data curation step is crucial for the entire system.

NVIDIA developed a pipeline to extract high-quality, dynamic video clips (100 million clips from 20 million hours of video) for training the models.

image/png Image credit: The original paper

Here’s how this pipeline works:

  1. It starts from collecting raw videos from both proprietary collections and publicly available internet videos, featuring different quality, formats and contents like driving, hand-object interactions, human activities, navigation, nature and more.
  2. Splitting videos: Algorithms detect scene changes using visual features like color or motion. Clips under 2 seconds are removed, and those over 60 seconds are split to a maximum of 60 seconds. Videos are also re-encoded into a high-quality MP4 format.
  3. Filtering: This step is used to improve the quality of the dataset.
    • Motion filtering: Removes static or erratically moving clips and tags clips based on camera panning or zooming.
    • Quality filtering: Discards videos with poor quality (blurriness, overexposure) and keeps only visually appealing content.
    • Overlay text filtering: Eliminates videos with added text (subtitles or graphics) that could interfere with learning.
    • Video type filtering: Focuses the dataset on useful content, such as human actions, while avoiding less relevant categories like animations or abstract visuals.
  4. Adding annotations: Descriptions of video content help AI models understand and learn from the data. That’s why the VILA 13B VLM is used to generate captions in each clip.
  5. Removing duplicates: Videos are clustered by visual content, duplicates are removed to keep the highest-quality version of each clip.
  6. Sharding: The processed clips are grouped into "shards" or data bundles, sorted by resolution and length. This makes the dataset easy to use for training.

The final result is a curated, diverse, and clean dataset with pre-training data (general-purpose clips for broad model training) and fine-tuning data (high-quality clips for specialized tasks).

Cosmos Tokenizer

NVIDIA specifically designed the Cosmos Tokenizer to handle both continuous and discrete tokenization for images and videos. It follows an encoder-decoder design.

image/png Image credit: The original paper

What is special about Cosmos Tokenizer?

  • Compression: The data is compressed both spatially (reducing resolution) and temporally (reducing the number of frames).
  • Wavelet transform: Simplifies the video input by removing redundant pixel information, making the data easier to process for causal downsampling.
  • Causal downsampling: Processes frames in sequence, past to present, without relying on future frames, which is crucial for real-world AI applications.
  • Advanced layers: The architecture uses causal temporal convolution and attention layers to maintain the natural order of video frames during encoding and decoding.
  • Spatio-temporal convolution: Captures both spatial (image) and temporal (time) patterns in the data.
  • Self-attention mechanisms: Helps the model focus on important details across frames.
  • Residual blocks: Add shortcut connections between input and output tokens to improve gradient flow and training stability.

These features and architecture lead to a high level of efficiency of Cosmos Tokenizer:

  • Even at higher compression ratios (8 × 8 × 8, 8 × 16 × 16), it maintains superior quality compared to other tokenizers. At 16 × 16, it often matches or surpasses competitors' image quality at 8 × 8.
  • It is 2~12x faster than other tokenizers.
  • It uses fewer parameters, making it lightweight and efficient.
  • Cosmos Tokenizer outperforms previous tokenizers in retaining details and smoothness in both images and videos.

And now we've reached the core part of the system - the WFMs.

Pre-trained WFMs

The Cosmos platform’s pre-trained WFMs are powerful tools for generating and predicting videos. They use both advanced techniques – diffusion and autoregressive modeling – to leverage the strengths of each.

Diffusion WFMs

Diffusion WFMs excel at generating high-quality, realistic outputs with smooth transitions. Here is how Cosmos Diffusion WFM’s components work together step by step:

image/png Image credit: The original paper

  1. The model begins by compressing the input video using the Cosmos Tokenizer, which transforms the video into a compact latent representation.

  2. Gaussian noise is added to the latent representation, simulating imperfections for the model to "denoise" and refine.

  3. 3D Patchification: The noisy latent representation is divided into smaller 3D patches (cubes of data) to simplify processing.

  4. The model processes these patches through multiple layers which include:

    • Self-attention: Helps the model focus on important details within the video.
    • Cross-attention: Integrates information from the input text to guide video generation.
    • Feed-forward MLP layers: Refines the features at each step.
    • Adaptive layer normalization: Ensures stable and efficient learning by adjusting the scale, shift, and gating of the data.
  5. After processing, the refined latent representation is passed through the decoder of the tokenizer, which reconstructs the final, high-quality video.

Diffusion WFM also employs 3D Rotary Positional Embeddings (RoPE), which allows the model to handle varying video lengths, resolutions, and aspect ratios seamlessly.

Prompt upsampler transforms simple prompts into detailed descriptions, ensuring generated videos align with user intent and enhance visual detail.

Cosmos Diffusion WFM has two configurations:

  • Text2World Models: Generate videos based on text prompts using cross-attention layers.
  • Video2World Models: Extend an existing video or predict future frames. They combine an initial video and descriptive prompt for richer predictions.

Results of Cosmos Diffusion WFM:

It is introduced in two sizes: the Cosmos-1.0-Diffusion-7B and 14B models. They both generate high-quality, realistic videos, but the 14B model excels in capturing complex scenes and maintaining motion stability compared to the 7B model.

Diffusion-based version of WFM results with photorealistic videos with smooth motion dynamics and accurate alignment with text prompts.

image/png Image credit: The original paper

Autoregressive WFMs

Autoregressive models are more efficient for step-by-step predictions, making them useful for sequential tasks. The Cosmos-1.0-Autoregressive-Video2World model generates future video frames by combining input videos and text prompts.

Here's how it works:

image/png Image credit: The original paper

  1. The input video is encoded using the encoder of Cosmos Tokenizer, converting it into discrete tokens. Text prompts are processed using a T5 text encoder, transforming them into embeddings that guide the video generation process.
  2. The video tokens are transformed into learned embeddings for further processing. Each transformer block uses:
  • 3D Positional Embeddings: Includes both absolute and rotary (RoPE) embeddings to capture spatial and temporal relationships in the video.
  • Self-attention: Focuses on important patterns in the video tokens.
  • Cross-attention: Cross-attention layers enable text-guided video generation by integrating text embeddings with video tokens to align the output with the input prompt.
  • MLP (Two-Layer Feed-Forward Network): Refines the processed information.
  1. Output reconstruction: The processed embeddings are converted back into video frames by the Cosmos Tokenizer decoder, reconstructing the video based on the output tokens.

To produce videos with sharp details and minimal artifacts, the diffusion decoder can be added to the architecture. It transforms discrete tokens into high-quality continuous representations, addressing blurry outputs from aggressive tokenization compression, especially in larger autoregressive WFMs.

Model variants include:

  • Base Models with 4B and 12 B parameters.
  • Video2World Models, like Cosmos-1.0-Autoregressive-13B-Video2World, that are text-guided variants. They are derived from base models, including cross-attention layers for text conditioning.

These models demonstrate the following results:

  • Larger models like 13B-Video2World generate videos with better motion consistency and richer details.
  • Smaller models like 4B are faster but may struggle with complex tasks.

Some limitations were also found:

  • Text inputs in text-conditioned models (Video2World) may not always strongly influence generation due to the model's training focus on video prediction tasks.
  • Occasionally, objects may appear unexpectedly, like "popping in" from below.

image/png Image credit: The original paper

How good are Cosmos WFMs?

The main purpose of any WFM is to simulate the real world, right? This means WFMs need to be tested for their ability to create realistic and physics-consistent videos. Evaluation of WFMs focuses on two aspects: 3D consistency and physics alignment.

  1. 3D Consistency shows how well the generated videos maintain realistic 3D structure and geometry. It includes geometric consistency, assessing adherence to 3D geometry using Sampson error and camera pose success rates, and view synthesis consistency, evaluating frame accuracy for novel viewpoints. So what about Cosmos WFMs’ capabilities? Cosmos WFMs outperform baseline models like VideoLDM in both geometric and view synthesis consistency. They achieve results close to real-world videos in terms of camera pose estimation and synthesized view quality.

image/png Image credit: The original paper

  1. Physics Alignment measures how well the generated videos adhere to the laws of physics, like gravity and motion dynamics. It also tests whether the models can predict realistic outcomes based on observed scenarios. Cosmos WFMs show the following results:
  • Diffusion-based WFMs perform better in pixel-level metrics, rendering higher-quality visuals than autoregressive WFMs.
  • Larger models produce better visual details but don't necessarily adhere better to physics laws.
  • Cosmos WFMs have common issues, such as objects disappearing or deforming, and violations of physics like implausible movements or ignoring gravity.

image/png Image credit: The original paper

Post-trained World Foundation Models implementation in Physical AI applications

Now, we are turning back to the Cosmos platform scheme, where the fourth stage is “Post-training WFMs Samples”. We have already mentioned that Cosmos WFMs can be fine-tuned to support various applications. Here are three examples tested by NVIDIA:

  1. Camera control: Fine-tuning Cosmos WFM allows camera control to create 3D navigable worlds from a single image, generating 3D-consistent and temporally coherent videos with realistic perspectives. Users can interactively explore the simulated world by moving the camera forward, backward, or rotating it left and right. AI agents can predict changes based on camera movements. Compared to CamCo, a state-of-the-art camera-controllable video generation model, Cosmos WFM offers superior video quality and camera trajectory re-estimation accuracy, generalizing well to new trajectories and data distributions. It also generates diverse outputs, simulating multiple possible futures from the same input.

  2. Robotic manipulation:

Cosmos WFMs can be fine-tuned for robotic manipulation by predicting video outputs from instructions or actions, aiding in task planning and simulation.

  • In instruction-based video prediction, the input is a video frame and text instruction, and the output is a predicted video of the robot following the instruction.
  • In action-based next-frame prediction, a robot action vector replaces text instructions to generate the next video frame, showing the action's result. Processing action sequences allows the model to create a full video of the robot completing tasks. Both Cosmos models outperformed the baseline, with the diffusion model achieving a 78.3% preference from human evaluators. Predicted video frames closely matched the ground truth, showcasing the models' precision.

image/png Image credit: The original paper

  1. Autonomous driving

Cosmos fine-tuned WFM can also generate realistic, consistent, and controlled multi-view driving scenarios. It generates videos from six camera views (front, rear, left, right, rear-left, rear-right) and mimics the camera setup of autonomous vehicles. It can generate driving scenarios (traffic density, weather, lighting, road types, vehicle speed, rivers and tollbooths) that align with specified vehicle trajectories, supporting precise control of driving paths. The Cosmos models followed trajectories with less than a 7 cm deviation, demonstrating precise adherence to the input paths. They also maintained geometric and temporal coherence across multiple views.

image/png Image credit: The original paper

What about safety or Guardrail system

As the Cosmos platform is designed for applications in diverse areas, developers and users should be confident that the platform is safe for any use case. To ensure the safe usage of Cosmos, a robust guardrail system is in place.

It works in two stages:

image/png Image credit: The original paper

  • Pre-Guard prevents unsafe prompts using:
    • Keyword blocking: A blocklist filters prompts for harmful keywords (violence, profanity) using lemmatization to compare root forms of words, like "ran" → "run".
    • Aegis-AI-Content-Safety model: It flags and blocks prompts related to violence, threats, harassment, and similar risks, displaying an error message if deemed unsafe.
  • Post-Guard filters outputs to ensure safe video generation, using:
    • Video safety filter: A classifier reviews frames, flagging videos if any frame is unsafe.
    • Face blur filter: Detects and pixelates faces larger than 20x20 pixels, maintaining scene context while protecting privacy.

Finally, we reviewed all parts of the Cosmos WFM platform, highlighting their benefits and strengths. But what about the limitations?

Limitations

While promising, current WFMs are early-stage simulators with notable challenges, such as:

  • Issues with object permanence, contact dynamics, and physical accuracy, like gravity, or light interactions.
  • Realism in generated videos often lacks adherence to fundamental physical principles.
  • Evaluation remains subjective, with human biases affecting physical fidelity assessments.

To overcome these issues researchers aim to include automated evaluation with multi-modal LLMs and physical simulators for reproducibility and interactive testing.

Conclusion

With the Cosmos WFM platform, NVIDIA has once again offered a treasure trove of concepts that work together seamlessly as a unified mechanism. This shift towards creating general-purpose simulators for the physical world shows that we are rapidly moving towards Physical AI. Of course, the Cosmos WFM platform has its limitations and needs a lot of improvement – but it’s one of the best systematic approaches on the market, and it’s encouraging that they made it available for people to experiment with.

In any case, Cosmos is just the beginning of the journey. With more people empowered by such tools and approaching Physical AI from diverse angles, we might achieve breakthroughs faster than we expect, won’t we?

Author: Alyona Vert Editor: Ksenia Se

Resources to dive deeper (you can also follow the organisations and authors)

Sources from Turing Post


📨 If you want to receive our articles straight to your inbox, please subscribe here


Community

Sign up or log in to comment