Learning Objectives
- Understand what a video world model is and why one-minute generation with explicit camera control matters for embodied AI
- Identify where SANA-WM fits relative to closed video models (Sora, Veo, Runway) and earlier open releases
- Evaluate when an open-source world model is the right tool for robotics, simulation, and research workflows
What Is SANA-WM?
SANA-WM is a 2.6 billion parameter open-source video world model released by NVIDIA Labs in May 2026. Unlike a general video-generation model that produces a single clip from a text prompt, SANA-WM is built as a controllable simulator — it generates 720p video up to one minute long while accepting 6 degrees of freedom of camera-pose input, meaning it can render the same scene from any specified camera trajectory frame-by-frame.
The model ships under the Apache 2.0 license alongside training code, model weights, and a project page at nvlabs.github.io/Sana/WM. The accompanying paper — SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer — is available on arXiv at 2605.15178. The team reports training on roughly 213,000 video clips in 15 days on 64 H100 GPUs, with a distilled variant that runs on a single consumer GPU.
💡Key Concept
World model: A model that learns the dynamics of a visual environment well enough to roll the environment forward into the future under specified actions or camera poses. World models are foundational for embodied AI, robotics simulation, autonomous-vehicle training, and any system that needs to predict "what happens if I move this way." Closed examples include Google's Genie and DeepMind's V-JEPA; SANA-WM is among the first open-source releases at one-minute resolution with explicit camera control.
✅Tip
Get started: Visit the project page at nvlabs.github.io/Sana/WM, read the arXiv paper at 2605.15178, and check the SANA GitHub repository for the released weights and inference code.
Architecture Highlights
SANA-WM's efficiency story rests on four design choices the paper details:
- Hybrid linear attention — combines gated linear attention with selective softmax attention so the model handles long video sequences without quadratic compute blow-up
- Dual-branch camera control — a dedicated trajectory-conditioning branch enforces precise 6-DoF camera-pose tracking frame-by-frame rather than relying on text descriptions alone
- Two-stage generation with refinement — coarse generation followed by a refinement pass, which lets the model produce one-minute clips without quality collapse over time
- Public-video annotation pipeline — the team built a tool that extracts camera poses from existing public video data, letting them train on roughly 213,000 clips without commissioning a custom motion-capture dataset
The combination keeps SANA-WM in the sweet spot where open-source video models have struggled: long enough to be useful for robotics rollouts, controllable enough to be useful for simulation, and cheap enough to run on a single consumer GPU.
Strengths
- Apache 2.0 open-source release — full weights, training code, and paper; no usage restrictions or attribution requirements beyond standard Apache terms
- One-minute video — most open video models cap out at a few seconds; SANA-WM holds quality across one-minute clips
- 6-DoF camera control — explicit trajectory input makes the model usable as a controllable simulator, not just a clip generator
- Consumer-GPU inference — the distilled variant runs on a single GPU, putting world-model research within reach of independent researchers and small labs
- Strong NVIDIA Labs lineage — sits alongside the existing SANA (image), SANA-Video, SANA-1.5, and SANA-Sprint releases, with shared tooling and a maintained codebase
Limitations & Considerations
- Research baseline, not production product — SANA-WM is a research artifact; documentation is still labeled "coming soon" on the main SANA repository as the team finalizes onboarding materials
- 720p ceiling — competitive open release at this scale, but well below the 4K-class output some closed video models offer for cinematic workflows
- Not a text-to-video product — camera-pose conditioning is the headline feature; users expecting Sora-style prompt-only generation should look at SANA-Video or closed alternatives
- Compute footprint — the distilled variant runs on one consumer GPU for inference, but full training still requires 64 H100s for 15 days, well outside typical individual budgets
Best Use Cases
| Task | Why SANA-WM |
|---|---|
| Robotics simulation rollouts | Minute-scale video with camera control lets researchers preview robot trajectories before physical deployment |
| Embodied-AI research baseline | Open weights + Apache 2.0 license make SANA-WM a useful reference baseline for new world-model papers |
| Autonomous-vehicle scenario generation | Controllable camera trajectories generate edge-case driving scenes for perception-model training |
| Game / film previsualization | Plan camera moves through generated environments before committing to expensive render budgets |
| Academic research on long-horizon video | Hybrid linear attention design is itself a contribution worth studying for sequence-modeling work |
When to choose alternatives:
- Pure text-to-video clip generation → OpenAI Sora, Google Veo, Runway Gen-4
- Closed but state-of-the-art world models → Google Genie, DeepMind V-JEPA
- Short-clip open-source generation → SANA-Video, ModelScope, Open-Sora
Getting Started
- Visit the SANA-WM project page and read the arXiv paper (2605.15178) to understand the architecture and training setup
- Clone the NVlabs/Sana GitHub repository and pull the SANA-WM weights once released
- Start with the distilled variant on a single consumer GPU to confirm the inference pipeline works on your hardware before scaling up
- Experiment with the camera-pose conditioning — pass simple linear trajectories first, then add rotation and translation
- For research work, pair SANA-WM with the broader SANA family (SANA-Video for short clips, SANA-1.5 for images) to build a complete generation stack
⚠️Warning
Open release, conservative claims. NVIDIA Labs positions SANA-WM as a baseline for embodied-AI research at a fraction of closed-model compute budgets — not as a production-ready video tool. Quality holds across one-minute clips in the paper's reported settings, but real-world prompt diversity, novel scenes, and edge cases will surface limitations the paper does not cover. Treat early experiments as research, not deployment.
Key Takeaways
- SANA-WM is NVIDIA Labs' 2.6 billion parameter open-source video world model — Apache 2.0, 720p, one-minute generation with explicit 6-DoF camera-pose control
- The architecture combines hybrid linear attention, dual-branch camera control, and a two-stage refinement pipeline to keep one-minute generation tractable on consumer GPUs
- Strongest fit for robotics simulation, embodied-AI research baselines, and controllable scenario generation — not a prompt-only video product
- Released alongside the broader SANA family (SANA-Video, SANA-1.5, SANA-Sprint); shares tooling and codebase
- Paper at arXiv 2605.15178; weights and code on GitHub under Apache 2.0