Name: SANA-WM
Availability: InStock
Author: NVIDIA

Learning Objectives

Understand what a video world model is and why one-minute generation with explicit camera control matters for embodied AI
Identify where SANA-WM fits relative to closed video models (Sora, Veo, Runway) and earlier open releases
Evaluate when an open-source world model is the right tool for robotics, simulation, and research workflows

What Is SANA-WM?

SANA-WM is a 2.6 billion parameter open-source video world model released by NVIDIA Labs in May 2026. Unlike a general video-generation model that produces a single clip from a text prompt, SANA-WM is built as a controllable simulator — it generates 720p video up to one minute long while accepting 6 degrees of freedom of camera-pose input, meaning it can render the same scene from any specified camera trajectory frame-by-frame.

The model ships under the Apache 2.0 license alongside training code, model weights, and a project page at nvlabs.github.io/Sana/WM. The accompanying paper — SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer — is available on arXiv at 2605.15178. The team reports training on roughly 213,000 video clips in 15 days on 64 H100 GPUs, with a distilled variant that runs on a single consumer GPU.

💡Key Concept

World model: A model that learns the dynamics of a visual environment well enough to roll the environment forward into the future under specified actions or camera poses. World models are foundational for embodied AI, robotics simulation, autonomous-vehicle training, and any system that needs to predict "what happens if I move this way." Closed examples include Google's Genie and DeepMind's V-JEPA; SANA-WM is among the first open-source releases at one-minute resolution with explicit camera control.

✅Tip

Get started: Visit the project page at nvlabs.github.io/Sana/WM, read the arXiv paper at 2605.15178, and check the SANA GitHub repository for the released weights and inference code.

Architecture Highlights

SANA-WM's efficiency story rests on four design choices the paper details:

Hybrid linear attention — combines gated linear attention with selective softmax attention so the model handles long video sequences without quadratic compute blow-up
Dual-branch camera control — a dedicated trajectory-conditioning branch enforces precise 6-DoF camera-pose tracking frame-by-frame rather than relying on text descriptions alone
Two-stage generation with refinement — coarse generation followed by a refinement pass, which lets the model produce one-minute clips without quality collapse over time
Public-video annotation pipeline — the team built a tool that extracts camera poses from existing public video data, letting them train on roughly 213,000 clips without commissioning a custom motion-capture dataset

The combination keeps SANA-WM in the sweet spot where open-source video models have struggled: long enough to be useful for robotics rollouts, controllable enough to be useful for simulation, and cheap enough to run on a single consumer GPU.

Strengths

Apache 2.0 open-source release — full weights, training code, and paper; no usage restrictions or attribution requirements beyond standard Apache terms
One-minute video — most open video models cap out at a few seconds; SANA-WM holds quality across one-minute clips
6-DoF camera control — explicit trajectory input makes the model usable as a controllable simulator, not just a clip generator
Consumer-GPU inference — the distilled variant runs on a single GPU, putting world-model research within reach of independent researchers and small labs
Strong NVIDIA Labs lineage — sits alongside the existing SANA (image), SANA-Video, SANA-1.5, and SANA-Sprint releases, with shared tooling and a maintained codebase

Limitations & Considerations

Research baseline, not production product — SANA-WM is a research artifact; documentation is still labeled "coming soon" on the main SANA repository as the team finalizes onboarding materials
720p ceiling — competitive open release at this scale, but well below the 4K-class output some closed video models offer for cinematic workflows
Not a text-to-video product — camera-pose conditioning is the headline feature; users expecting Sora-style prompt-only generation should look at SANA-Video or closed alternatives
Compute footprint — the distilled variant runs on one consumer GPU for inference, but full training still requires 64 H100s for 15 days, well outside typical individual budgets

Best Use Cases

Task	Why SANA-WM
Robotics simulation rollouts	Minute-scale video with camera control lets researchers preview robot trajectories before physical deployment
Embodied-AI research baseline	Open weights + Apache 2.0 license make SANA-WM a useful reference baseline for new world-model papers
Autonomous-vehicle scenario generation	Controllable camera trajectories generate edge-case driving scenes for perception-model training
Game / film previsualization	Plan camera moves through generated environments before committing to expensive render budgets
Academic research on long-horizon video	Hybrid linear attention design is itself a contribution worth studying for sequence-modeling work

When to choose alternatives:

Pure text-to-video clip generation → OpenAI Sora, Google Veo, Runway Gen-4
Closed but state-of-the-art world models → Google Genie, DeepMind V-JEPA
Short-clip open-source generation → SANA-Video, ModelScope, Open-Sora

Getting Started

Visit the SANA-WM project page and read the arXiv paper (2605.15178) to understand the architecture and training setup
Clone the NVlabs/Sana GitHub repository and pull the SANA-WM weights once released
Start with the distilled variant on a single consumer GPU to confirm the inference pipeline works on your hardware before scaling up
Experiment with the camera-pose conditioning — pass simple linear trajectories first, then add rotation and translation
For research work, pair SANA-WM with the broader SANA family (SANA-Video for short clips, SANA-1.5 for images) to build a complete generation stack

⚠️Warning

Open release, conservative claims. NVIDIA Labs positions SANA-WM as a baseline for embodied-AI research at a fraction of closed-model compute budgets — not as a production-ready video tool. Quality holds across one-minute clips in the paper's reported settings, but real-world prompt diversity, novel scenes, and edge cases will surface limitations the paper does not cover. Treat early experiments as research, not deployment.

Key Takeaways

SANA-WM is NVIDIA Labs' 2.6 billion parameter open-source video world model — Apache 2.0, 720p, one-minute generation with explicit 6-DoF camera-pose control
The architecture combines hybrid linear attention, dual-branch camera control, and a two-stage refinement pipeline to keep one-minute generation tractable on consumer GPUs
Strongest fit for robotics simulation, embodied-AI research baselines, and controllable scenario generation — not a prompt-only video product
Released alongside the broader SANA family (SANA-Video, SANA-1.5, SANA-Sprint); shares tooling and codebase
Paper at arXiv 2605.15178; weights and code on GitHub under Apache 2.0

SANA-WM

Audio & video lessons are paid features