Learning Objectives
- Understand how Step 3.7 Flash fits in the open-weights tier alongside Liquid LFM, Kimi K2.6, and DeepSeek
- Identify the mixture-of-experts architecture choice and what it means for inference cost and throughput
- Evaluate when Step 3.7 Flash is the right choice for agentic coding, search, and vision-language workflows
What Is Step 3.7 Flash?
Step 3.7 Flash is the headline open-weights release from StepFun (also written Stepfun), a Shanghai-based AI lab founded in 2023. It is a mixture-of-experts vision-language model engineered for agentic deployment — coding agents, search workflows, and multi-step tool use — rather than chasing raw frontier capability against the largest US labs. Released in late May 2026 under the Apache 2.0 license, the model ships with day-one hosted inference on the StepFun Open Platform, OpenRouter, and NVIDIA NIM, with DeepInfra, Fireworks AI, and Modal next in line.
The naming convention matters: Flash in StepFun's product line denotes the high-efficiency variant tuned for throughput and cost rather than the absolute capability ceiling. Step 3.7 Flash succeeds Step 3.5 Flash and sits in the same competitive band as Kimi K2.6, DeepSeek V4 Flash, and Liquid LFM2.5 — Chinese-or-MIT open-weights labs competing on the practical agentic-deployment dimension rather than the closed-source frontier.
✅Tip
Try Step 3.7 Flash: platform.stepfun.com — StepFun Open Platform with hosted inference; also available on OpenRouter and NVIDIA NIM. Open weights downloadable from the stepfun-ai Hugging Face organization.
Architecture and Specifications
The current flagship is Step 3.7 Flash, released open-weight in late May 2026. The architecture pairs a 196 billion parameter language backbone with a 1.8 billion parameter Vision Transformer (ViT), making the model natively multimodal from the first token.
| Specification | Step 3.7 Flash |
|---|---|
| Total parameters | 198 billion (196 billion language plus 1.8 billion vision) |
| Active parameters per token | Roughly 11 billion via mixture-of-experts routing |
| Context window | 256,000 tokens |
| Throughput | Up to 400 tokens per second on hosted inference |
| License | Apache 2.0 — no use restrictions |
| Checkpoint formats | BF16, FP8, NVFP4, and GGUF |
| Reasoning modes | Three selectable levels (low, medium, high) |
The 198 billion total parameter budget with roughly 11 billion active per token puts Step 3.7 Flash in the same architectural class as DeepSeek V4 and Kimi K2.6 — a sparse mixture-of-experts design that delivers the knowledge capacity of a much larger dense model at the inference cost of a smaller one. Three selectable reasoning levels let the same checkpoint cover quick lookups, intermediate planning, and deep multi-step tool-use workflows without swapping models.
💡Key Concept
Why mixture-of-experts for agentic workloads. An agentic loop alternates between cheap routing decisions (which tool to call next, which sub-goal to pursue) and expensive reasoning calls (compose the response, debug the failure). A sparse mixture-of-experts model can spend the cheap-call budget at low active-parameter cost and reserve the expensive-call budget for the few moments where it matters. The architecture is what makes Step 3.7 Flash's Advisor Mode pricing competitive with much smaller dense alternatives without sacrificing the capability ceiling on the hard turns.
Benchmark Performance
StepFun reports the following benchmark wins for Step 3.7 Flash:
| Benchmark | Step 3.7 Flash |
|---|---|
| SWE-Bench Pro (coding) | 56.26 percent |
| Terminal-Bench 2.1 (coding) | 59.55 percent |
| SimpleVQA with Search (search agents) | 79.16 percent |
| ClawEval-1.1 (general agents) | 67.07 percent |
The SimpleVQA with Search score puts Step 3.7 Flash effectively at parity with GPT-5.5 on that benchmark, and the lab claims its Advisor coding mode reaches roughly 97 percent of Claude Opus 4.6's coding performance at approximately one-ninth the per-task cost. The ClawEval-1.1 result outperforms DeepSeek V4 Flash on the same general-agents benchmark.
⚠️Warning
Open-weight benchmark caveat. Vendor-reported numbers reflect the configuration the vendor ran. Hosted inference on OpenRouter or NVIDIA NIM may use different quantization, runtime, or prompt-formatting choices than the StepFun Open Platform reference deployment, and scores can shift several points in either direction. Validate against your actual deployment configuration — same runtime, same quantization, same context length, same reasoning mode — before committing to a production pattern.
Vision-Language Capabilities
The 1.8 billion parameter Vision Transformer is native to the architecture, not a separate adapter — image inputs flow through the same token stream as text. The headline application is the built-in Visual Search tool, which lets the model invoke a search call against an image (chart, diagram, screenshot, document page) and continue reasoning over the retrieved context. The design partially compensates for the limited parametric knowledge that comes with the moderate active-parameter budget: rather than memorizing facts, the model is tuned to retrieve them via tool calls.
Pricing
- Download Step 3.7 Flash from Hugging Face
- No use restrictions
- Self-host, fine-tune, redistribute derivatives
- Available on OpenRouter, NVIDIA NIM, StepFun Open Platform
- Pay per million input plus output tokens
- No infrastructure to manage
- Custom deployment support
- Volume pricing and SLAs
- Dedicated capacity
For most evaluators the right starting point is the StepFun Open Platform or OpenRouter hosted endpoint — fastest path from sign-up to working API calls. For data-residency-sensitive deployments or sustained heavy use, the open-weight tier on Hugging Face removes any per-token cost and lets you self-host on your chosen accelerator stack (BF16 for highest fidelity, FP8 or NVFP4 for cost, GGUF for llama.cpp deployments).
Strengths
- Apache 2.0 open weights: No use restrictions, no acceptable-use addenda — substantially more permissive than the Llama community license or Gemma's Google AUP
- Mixture-of-experts efficiency: 11 billion active parameters per token deliver knowledge capacity of a 198 billion total budget at inference cost closer to a smaller dense model
- Native vision-language from the first token: Visual Search tool plus on-graph image reasoning, no separate adapter pipeline required
- Three reasoning modes in one checkpoint: Quick lookups, intermediate planning, and deep tool-use workflows without model swaps
- Day-one hosted inference partners: OpenRouter, NVIDIA NIM, and the StepFun Open Platform available at launch; DeepInfra, Fireworks AI, and Modal queued for fast follow
- Agentic-coding benchmark parity at fractional cost: Advisor Mode claims roughly 97 percent of Claude Opus 4.6 coding performance at approximately one-ninth the per-task cost
- Multiple checkpoint formats: BF16, FP8, NVFP4, and GGUF support both research-grade and cost-optimized deployments out of the box
Limitations & Considerations
- Not a frontier-capability model: Claude Opus 4.8, GPT-5.5, and Gemini 3.5 Pro maintain meaningful leads at the absolute capability ceiling; Step 3.7 Flash is positioned as the agentic-deployment alternative, not the flagship competitor
- Newer lab with smaller community: StepFun is well-established in the Chinese open-weights tier but has fewer English-language tutorials, integration guides, and third-party tooling than Llama or Mistral derivatives
- Limited parametric knowledge for the 11 billion active budget: The model is tuned to retrieve facts via tool calls rather than memorize them; deployments without robust tool wiring may feel less knowledgeable than the parameter count suggests
- Hosted inference quality varies by provider: Hosted endpoints on OpenRouter, NVIDIA NIM, and partner platforms may use different quantization choices — benchmark scores and behavior can shift between providers
- China-based lab with US regulatory considerations: Customers deploying Step 3.7 Flash in regulated US contexts (government, defense, certain financial verticals) should verify their compliance posture with respect to Chinese-origin open-weights models before production deployment
Best Use Cases
| Task | Why Step 3.7 Flash |
|---|---|
| Cost-sensitive agentic coding workflows | Advisor Mode at roughly one-ninth the per-task cost of Claude Opus 4.6 with comparable coding scores |
| Search agents with image inputs | Native Visual Search tool plus 79.16 percent SimpleVQA score with search |
| Long-context tool-use loops | 256,000-token context window with three reasoning modes covers planning, retrieval, and synthesis turns in one model |
| Open-weight deployment without license friction | Apache 2.0 with no use restrictions — self-host, fine-tune, redistribute without negotiating with StepFun |
| Multi-provider inference strategy | Same Apache 2.0 checkpoint across OpenRouter, NVIDIA NIM, DeepInfra, Fireworks AI, and Modal for failover and cost optimization |
When to choose alternatives:
- Absolute frontier capability ceiling for hosted use → Claude Opus 4.8, GPT-5.5, Gemini 3.5 Pro
- Largest open-weight community and tooling → Llama 4 derivatives or Mistral Large 3
- On-device deployment with minimal memory footprint → Liquid LFM2.5-8B-A1B
- Strongest Chinese open-weights chat experience → Kimi K2.6 or DeepSeek V4
Getting Started
- Try the hosted endpoint first — sign in at platform.stepfun.com for a StepFun API key, or use OpenRouter if you already have an OpenRouter account
- Pick a reasoning mode — start with medium for general workflows; switch to low for high-volume routing or high for deep multi-step debugging
- Wire the Visual Search tool if your workflow involves image or document inputs — Step 3.7 Flash is tuned to invoke it rather than memorize knowledge
- For self-hosted deployment — download the open weights from the stepfun-ai Hugging Face organization; pick a checkpoint format (BF16 for fidelity, FP8 or NVFP4 for cost, GGUF for llama.cpp) and a serving runtime (vLLM, SGLang, llama.cpp, or NVIDIA NIM)
- Benchmark on your workload before committing — vendor numbers are reference-deployment scores; validate against your actual prompts, runtime, quantization, and reasoning mode before production rollout
Key Takeaways
- Step 3.7 Flash is the May 2026 flagship open-weights release from Shanghai-based StepFun — a 198 billion total parameter mixture-of-experts vision-language model with roughly 11 billion active parameters per token
- The model exposes a 256,000-token context window, three selectable reasoning modes, and reported throughput up to 400 tokens per second under hosted inference
- Apache 2.0 weights with no use restrictions plus day-one hosting on the StepFun Open Platform, OpenRouter, and NVIDIA NIM close the integration gap that previously slowed open-weight adoption
- Benchmark wins focus on agentic coding (56.26 percent on SWE-Bench Pro) and search workflows (79.16 percent on SimpleVQA with Search) — Advisor Mode reaches roughly 97 percent of Claude Opus 4.6's coding performance at approximately one-ninth the per-task cost
- Best suited for cost-sensitive agentic coding, search agents with image inputs, long-context tool-use loops, and any open-weight deployment that needs Apache 2.0 permissiveness
- Sits alongside Liquid LFM2.5, Kimi K2.6, and DeepSeek V4 as one of the labs building explicitly for practical agentic deployment rather than frontier capability ceiling