Learning Objectives
- Understand ZAYA1-8B's mixture-of-experts (MoE) architecture and why a 760 million active-parameter footprint matters for inference cost
- Identify what's significant about training a frontier-quality math model entirely on AMD Instinct MI300X — the first such public claim
- Evaluate when ZAYA1-8B is preferable to comparable open-weight reasoning models (DeepSeek-R1, Qwen reasoning variants)
What Is ZAYA1-8B?
ZAYA1-8B is an open-weight mixture-of-experts language model released by Zyphra, a San Francisco AI lab. The model has 8.4 billion total parameters but activates only 760 million parameters per inference token — roughly an order of magnitude smaller active footprint than its dense-model peers. On math benchmarks, ZAYA1-8B matches DeepSeek-R1, scoring 89.1 on AIME 2026 and 71.6 on HMMT, and stays competitive with Claude Sonnet 4.5 on broader reasoning despite the much smaller compute budget per token.
What sets ZAYA1-8B apart from the rest of the open-weight reasoning landscape is the training stack. The model was trained on a 1,024-node AMD Instinct MI300X cluster built with IBM, using AMD Pensando Pollara networking — making it the first frontier-quality math model trained entirely outside the NVIDIA CUDA stack. Combined with a custom mixture-of-experts design and a specialized attention mechanism that preserves reasoning quality at lower active-parameter budgets, the AMD-only training claim is a meaningful signal that hyperscaler-grade training no longer requires NVIDIA hardware.
The weights are released under Apache 2.0 on Hugging Face — the most permissive open-model license, with no restrictions on commercial use or competitive applications. Local deployment requires Zyphra's vLLM fork rather than standard vLLM, so plan for the runtime swap if you're integrating the model into existing inference pipelines.
✅Tip
Get ZAYA1-8B: huggingface.co/Zyphra — Apache 2.0 license, weights downloadable; serverless inference via Zyphra Cloud
Pricing and Access
| Access Method | Cost | Best For |
|---|---|---|
| Hugging Face Download | Free (Apache 2.0) | Development, research, custom deployment with Zyphra's vLLM fork |
| Zyphra Cloud (serverless) | Usage-based | Quick API access without managing GPU infrastructure |
| Self-hosted | Free + your compute | Production deployment with control over inference stack |
ZAYA1-8B is freely downloadable under Apache 2.0 — the most permissive open-source license, with no restrictions on commercial or competitive use. This is more permissive than the Gemma license (which restricts training competing AI services).
Core Capabilities
Frontier-Quality Math at 760 Million Active Parameters
Mathematical reasoning is the headline strength. On AIME 2026 ZAYA1-8B scores 89.1, matching DeepSeek-R1 — a much larger model. On HMMT the score is 71.6, also matching DeepSeek-R1. The achievement is doubly notable because most models in the 8 billion total / 760 million active range struggle on competition-grade math benchmarks; the mixture-of-experts routing concentrates reasoning capacity in the active subset of weights without spreading it thin across the full parameter pool.
For comparison, Zyphra reports ZAYA1-8B stays competitive with Claude Sonnet 4.5 on broader reasoning — meaningfully short of the frontier closed-API tier overall, but a strong showing for an order-of-magnitude smaller active-parameter budget.
Mixture-of-Experts Architecture
ZAYA1-8B uses a custom mixture-of-experts (MoE) design where most of the 8.4 billion total parameters sit dormant for any given inference token; the model routes each token to a small subset of "experts," activating only 760 million parameters per token. The architecture also includes a specialized attention mechanism designed to preserve reasoning quality at lower active-parameter budgets — Zyphra's published research describes this as an explicit trade against the standard practice of scaling dense models.
Training Stack — AMD MI300X Only, No NVIDIA
The most strategically significant claim. ZAYA1-8B was trained on a 1,024-node cluster of AMD Instinct MI300X GPUs, built with IBM, using AMD Pensando Pollara for cluster networking. The MI300X has 192 GB HBM3 per GPU — the largest unified-memory GPU in the AMD Instinct line — making it well-suited to the high-memory-pressure phases of MoE training.
The wider implication: AMD's Instinct line, paired with Pensando networking and a software stack that does not depend on CUDA, is now demonstrably capable of training frontier-quality reasoning models. This sits alongside Cerebras and Groq as evidence that the non-NVIDIA frontier-AI training market is maturing.
Local Deployment via Zyphra's vLLM Fork
For self-hosted deployment, ZAYA1-8B requires Zyphra's fork of vLLM rather than upstream vLLM — the MoE routing kernel and the custom attention mechanism need patches that have not yet landed in mainline vLLM. The fork is open-source on Zyphra's GitHub. Plan for this dependency if you are slotting ZAYA1-8B into an existing inference platform; serverless deployment via Zyphra Cloud avoids the runtime switch entirely.
Strengths
- Frontier-quality math at small active size: Matches DeepSeek-R1 on AIME 2026 (89.1) and HMMT (71.6) with only 760 million active parameters per token
- AMD-only training: First public claim of a frontier-quality math model trained entirely on AMD Instinct MI300X with Pensando Pollara networking — no NVIDIA dependency
- Apache 2.0 license: Most permissive open-model license; commercial and competitive use unrestricted
- Order-of-magnitude smaller active footprint: Inference cost scales with active parameters, so ZAYA1-8B runs much cheaper than dense models in the same quality tier
- Available immediately: Weights on Hugging Face, serverless deployment via Zyphra Cloud, no waitlist
Limitations & Considerations
- Custom runtime requirement: Self-hosted deployment requires Zyphra's vLLM fork, not upstream vLLM
- Specialized strength profile: Math and reasoning lead; less strong on creative writing, broad multilingual coverage, and very long context
- Smaller than frontier closed models on hardest tasks: Competitive with Claude Sonnet 4.5 on broader reasoning but not Claude Opus 4.7 or GPT-5.5 on the hardest tasks
- New lab: Zyphra is a smaller AI startup; long-term operational and support track record is still being established
- AMD-only training does not equal AMD-only inference: ZAYA1-8B inference runs on standard NVIDIA, AMD, or CPU hardware; the AMD-specific story is about the training pipeline only
Best Use Cases
| Task | Why ZAYA1-8B |
|---|---|
| Math and competition-grade reasoning | AIME and HMMT scores match DeepSeek-R1 at much smaller active-parameter footprint |
| Cost-sensitive reasoning deployments | 760 million active parameters per token keeps inference cost low at production scale |
| Open-source research and competitive products | Apache 2.0 license has no restrictions on competitive AI training or commercial deployment |
| AMD-stack deployments | The AMD Instinct training pedigree signals strong compatibility with non-NVIDIA inference stacks too |
| Self-hosted privacy-sensitive workflows | Open weights plus permissive license allow on-premise deployment without data egress |
When to choose alternatives:
- Maximum reasoning ceiling → Claude Opus 4.7, GPT-5.5, or DeepSeek V4-Pro for the hardest reasoning tasks
- Multilingual breadth → Qwen3.5 (over 100 languages) or Gemma 4 (over 35 languages) for multilingual workloads
- Standard inference runtime → Models that run on stock vLLM (Llama, Qwen, Gemma) for plug-and-play deployment
- Smallest footprint at all costs → Phi-4 Mini or Gemma 4 E2B for the absolute smallest edge devices
Getting Started
- Download weights from huggingface.co/Zyphra under Apache 2.0
- Clone Zyphra's vLLM fork from GitHub for local deployment, or use Zyphra Cloud for serverless API access
- Run a math-benchmark probe (a small subset of AIME or HMMT problems) before committing to broader integration — the model's strength profile is reasoning-heavy
- For self-hosted production: budget for the runtime swap to Zyphra's vLLM fork; plan a rollback path if upstream-vLLM features you depend on are not yet patched in
- For competitive workloads where Apache 2.0 matters, ZAYA1-8B is a stronger license fit than Gemma 4 (which restricts competing-service training)
✅Tip
Why the AMD training story matters: Most open-weight reasoning models published before May 2026 disclosed NVIDIA training infrastructure. ZAYA1-8B's AMD-only stack is a credible alternative path — relevant for organizations that want to model their own training capacity on AMD silicon (especially as MI400 series production scales).
Key Takeaways
- ZAYA1-8B is Zyphra's open-weight mixture-of-experts model with 8.4 billion total parameters and only 760 million active per token — frontier-quality math at much smaller active-parameter footprint
- Matches DeepSeek-R1 on AIME 2026 (89.1) and HMMT (71.6); competitive with Claude Sonnet 4.5 on broader reasoning
- Trained entirely on AMD Instinct MI300X GPUs with AMD Pensando Pollara networking — the first public claim of a frontier-quality math model trained outside the NVIDIA CUDA stack
- Apache 2.0 license — no restrictions on commercial or competitive use, more permissive than the Gemma license
- Self-hosted deployment requires Zyphra's vLLM fork; serverless deployment via Zyphra Cloud avoids the runtime switch