Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
5 min read·Updated May 17, 2026

SANA-WM

NVIDIA logoBy NVIDIA

SANA-WM is NVIDIA Labs' 2.6 billion parameter open-source video world model — Apache 2.0, 720p, one-minute generation with 6 degrees of camera-pose control, designed as a baseline for embodied-AI and robotics research at consumer-GPU compute budgets.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what a video world model is and why one-minute generation with explicit camera control matters for embodied AI
  • Identify where SANA-WM fits relative to closed video models (Sora, Veo, Runway) and earlier open releases
  • Evaluate when an open-source world model is the right tool for robotics, simulation, and research workflows

What Is SANA-WM?

SANA-WM is a 2.6 billion parameter open-source video world model released by NVIDIA Labs in May 2026. Unlike a general video-generation model that produces a single clip from a text prompt, SANA-WM is built as a controllable simulator — it generates 720p video up to one minute long while accepting 6 degrees of freedom of camera-pose input, meaning it can render the same scene from any specified camera trajectory frame-by-frame.

The model ships under the Apache 2.0 license alongside training code, model weights, and a project page at nvlabs.github.io/Sana/WM. The accompanying paper — SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer — is available on arXiv at 2605.15178. The team reports training on roughly 213,000 video clips in 15 days on 64 H100 GPUs, with a distilled variant that runs on a single consumer GPU.

💡Key Concept

World model: A model that learns the dynamics of a visual environment well enough to roll the environment forward into the future under specified actions or camera poses. World models are foundational for embodied AI, robotics simulation, autonomous-vehicle training, and any system that needs to predict "what happens if I move this way." Closed examples include Google's Genie and DeepMind's V-JEPA; SANA-WM is among the first open-source releases at one-minute resolution with explicit camera control.

Tip

Get started: Visit the project page at nvlabs.github.io/Sana/WM, read the arXiv paper at 2605.15178, and check the SANA GitHub repository for the released weights and inference code.

Architecture Highlights

SANA-WM's efficiency story rests on four design choices the paper details:

  • Hybrid linear attention — combines gated linear attention with selective softmax attention so the model handles long video sequences without quadratic compute blow-up
  • Dual-branch camera control — a dedicated trajectory-conditioning branch enforces precise 6-DoF camera-pose tracking frame-by-frame rather than relying on text descriptions alone
  • Two-stage generation with refinement — coarse generation followed by a refinement pass, which lets the model produce one-minute clips without quality collapse over time
  • Public-video annotation pipeline — the team built a tool that extracts camera poses from existing public video data, letting them train on roughly 213,000 clips without commissioning a custom motion-capture dataset

The combination keeps SANA-WM in the sweet spot where open-source video models have struggled: long enough to be useful for robotics rollouts, controllable enough to be useful for simulation, and cheap enough to run on a single consumer GPU.

Strengths

  • Apache 2.0 open-source release — full weights, training code, and paper; no usage restrictions or attribution requirements beyond standard Apache terms
  • One-minute video — most open video models cap out at a few seconds; SANA-WM holds quality across one-minute clips
  • 6-DoF camera control — explicit trajectory input makes the model usable as a controllable simulator, not just a clip generator
  • Consumer-GPU inference — the distilled variant runs on a single GPU, putting world-model research within reach of independent researchers and small labs
  • Strong NVIDIA Labs lineage — sits alongside the existing SANA (image), SANA-Video, SANA-1.5, and SANA-Sprint releases, with shared tooling and a maintained codebase

Limitations & Considerations

  • Research baseline, not production product — SANA-WM is a research artifact; documentation is still labeled "coming soon" on the main SANA repository as the team finalizes onboarding materials
  • 720p ceiling — competitive open release at this scale, but well below the 4K-class output some closed video models offer for cinematic workflows
  • Not a text-to-video product — camera-pose conditioning is the headline feature; users expecting Sora-style prompt-only generation should look at SANA-Video or closed alternatives
  • Compute footprint — the distilled variant runs on one consumer GPU for inference, but full training still requires 64 H100s for 15 days, well outside typical individual budgets

Best Use Cases

TaskWhy SANA-WM
Robotics simulation rolloutsMinute-scale video with camera control lets researchers preview robot trajectories before physical deployment
Embodied-AI research baselineOpen weights + Apache 2.0 license make SANA-WM a useful reference baseline for new world-model papers
Autonomous-vehicle scenario generationControllable camera trajectories generate edge-case driving scenes for perception-model training
Game / film previsualizationPlan camera moves through generated environments before committing to expensive render budgets
Academic research on long-horizon videoHybrid linear attention design is itself a contribution worth studying for sequence-modeling work

When to choose alternatives:

  • Pure text-to-video clip generation → OpenAI Sora, Google Veo, Runway Gen-4
  • Closed but state-of-the-art world models → Google Genie, DeepMind V-JEPA
  • Short-clip open-source generation → SANA-Video, ModelScope, Open-Sora

Getting Started

  1. Visit the SANA-WM project page and read the arXiv paper (2605.15178) to understand the architecture and training setup
  2. Clone the NVlabs/Sana GitHub repository and pull the SANA-WM weights once released
  3. Start with the distilled variant on a single consumer GPU to confirm the inference pipeline works on your hardware before scaling up
  4. Experiment with the camera-pose conditioning — pass simple linear trajectories first, then add rotation and translation
  5. For research work, pair SANA-WM with the broader SANA family (SANA-Video for short clips, SANA-1.5 for images) to build a complete generation stack

⚠️Warning

Open release, conservative claims. NVIDIA Labs positions SANA-WM as a baseline for embodied-AI research at a fraction of closed-model compute budgets — not as a production-ready video tool. Quality holds across one-minute clips in the paper's reported settings, but real-world prompt diversity, novel scenes, and edge cases will surface limitations the paper does not cover. Treat early experiments as research, not deployment.

Key Takeaways

  • SANA-WM is NVIDIA Labs' 2.6 billion parameter open-source video world model — Apache 2.0, 720p, one-minute generation with explicit 6-DoF camera-pose control
  • The architecture combines hybrid linear attention, dual-branch camera control, and a two-stage refinement pipeline to keep one-minute generation tractable on consumer GPUs
  • Strongest fit for robotics simulation, embodied-AI research baselines, and controllable scenario generation — not a prompt-only video product
  • Released alongside the broader SANA family (SANA-Video, SANA-1.5, SANA-Sprint); shares tooling and codebase
  • Paper at arXiv 2605.15178; weights and code on GitHub under Apache 2.0

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you