Name: Step 3.7 Flash
Availability: InStock
Author: StepFun

Learning Objectives

Understand how Step 3.7 Flash fits in the open-weights tier alongside Liquid LFM, Kimi K2.6, and DeepSeek
Identify the mixture-of-experts architecture choice and what it means for inference cost and throughput
Evaluate when Step 3.7 Flash is the right choice for agentic coding, search, and vision-language workflows

What Is Step 3.7 Flash?

Step 3.7 Flash is the headline open-weights release from StepFun (also written Stepfun), a Shanghai-based AI lab founded in 2023. It is a mixture-of-experts vision-language model engineered for agentic deployment — coding agents, search workflows, and multi-step tool use — rather than chasing raw frontier capability against the largest US labs. Released in late May 2026 under the Apache 2.0 license, the model ships with day-one hosted inference on the StepFun Open Platform, OpenRouter, and NVIDIA NIM, with DeepInfra, Fireworks AI, and Modal next in line.

The naming convention matters: Flash in StepFun's product line denotes the high-efficiency variant tuned for throughput and cost rather than the absolute capability ceiling. Step 3.7 Flash succeeds Step 3.5 Flash and sits in the same competitive band as Kimi K2.6, DeepSeek V4 Flash, and Liquid LFM2.5 — Chinese-or-MIT open-weights labs competing on the practical agentic-deployment dimension rather than the closed-source frontier.

✅Tip

Try Step 3.7 Flash: platform.stepfun.com — StepFun Open Platform with hosted inference; also available on OpenRouter and NVIDIA NIM. Open weights downloadable from the stepfun-ai Hugging Face organization.

Architecture and Specifications

The current flagship is Step 3.7 Flash, released open-weight in late May 2026. The architecture pairs a 196 billion parameter language backbone with a 1.8 billion parameter Vision Transformer (ViT), making the model natively multimodal from the first token.

Specification	Step 3.7 Flash
Total parameters	198 billion (196 billion language plus 1.8 billion vision)
Active parameters per token	Roughly 11 billion via mixture-of-experts routing
Context window	256,000 tokens
Throughput	Up to 400 tokens per second on hosted inference
License	Apache 2.0 — no use restrictions
Checkpoint formats	BF16, FP8, NVFP4, and GGUF
Reasoning modes	Three selectable levels (low, medium, high)

The 198 billion total parameter budget with roughly 11 billion active per token puts Step 3.7 Flash in the same architectural class as DeepSeek V4 and Kimi K2.6 — a sparse mixture-of-experts design that delivers the knowledge capacity of a much larger dense model at the inference cost of a smaller one. Three selectable reasoning levels let the same checkpoint cover quick lookups, intermediate planning, and deep multi-step tool-use workflows without swapping models.

💡Key Concept

Why mixture-of-experts for agentic workloads. An agentic loop alternates between cheap routing decisions (which tool to call next, which sub-goal to pursue) and expensive reasoning calls (compose the response, debug the failure). A sparse mixture-of-experts model can spend the cheap-call budget at low active-parameter cost and reserve the expensive-call budget for the few moments where it matters. The architecture is what makes Step 3.7 Flash's Advisor Mode pricing competitive with much smaller dense alternatives without sacrificing the capability ceiling on the hard turns.

Benchmark Performance

StepFun reports the following benchmark wins for Step 3.7 Flash:

Benchmark	Step 3.7 Flash
SWE-Bench Pro (coding)	56.26 percent
Terminal-Bench 2.1 (coding)	59.55 percent
SimpleVQA with Search (search agents)	79.16 percent
ClawEval-1.1 (general agents)	67.07 percent

The SimpleVQA with Search score puts Step 3.7 Flash effectively at parity with GPT-5.5 on that benchmark, and the lab claims its Advisor coding mode reaches roughly 97 percent of Claude Opus 4.6's coding performance at approximately one-ninth the per-task cost. The ClawEval-1.1 result outperforms DeepSeek V4 Flash on the same general-agents benchmark.

⚠️Warning

Open-weight benchmark caveat. Vendor-reported numbers reflect the configuration the vendor ran. Hosted inference on OpenRouter or NVIDIA NIM may use different quantization, runtime, or prompt-formatting choices than the StepFun Open Platform reference deployment, and scores can shift several points in either direction. Validate against your actual deployment configuration — same runtime, same quantization, same context length, same reasoning mode — before committing to a production pattern.

Vision-Language Capabilities

The 1.8 billion parameter Vision Transformer is native to the architecture, not a separate adapter — image inputs flow through the same token stream as text. The headline application is the built-in Visual Search tool, which lets the model invoke a search call against an image (chart, diagram, screenshot, document page) and continue reasoning over the retrieved context. The design partially compensates for the limited parametric knowledge that comes with the moderate active-parameter budget: rather than memorizing facts, the model is tuned to retrieve them via tool calls.

Pricing

Plan	Price	Features
Open Weights	Free	Download Step 3.7 Flash from Hugging Face No use restrictions Self-host, fine-tune, redistribute derivatives
Hosted inference	Per-token pricing varies by host	Available on OpenRouter, NVIDIA NIM, StepFun Open Platform Pay per million input plus output tokens No infrastructure to manage
Enterprise	Contact StepFun	Custom deployment support Volume pricing and SLAs Dedicated capacity

Open WeightsFree

Download Step 3.7 Flash from Hugging Face
No use restrictions
Self-host, fine-tune, redistribute derivatives

Hosted inferencePer-token pricing varies by host

Available on OpenRouter, NVIDIA NIM, StepFun Open Platform
Pay per million input plus output tokens
No infrastructure to manage

EnterpriseContact StepFun

Custom deployment support
Volume pricing and SLAs
Dedicated capacity

For most evaluators the right starting point is the StepFun Open Platform or OpenRouter hosted endpoint — fastest path from sign-up to working API calls. For data-residency-sensitive deployments or sustained heavy use, the open-weight tier on Hugging Face removes any per-token cost and lets you self-host on your chosen accelerator stack (BF16 for highest fidelity, FP8 or NVFP4 for cost, GGUF for llama.cpp deployments).

Strengths

Apache 2.0 open weights: No use restrictions, no acceptable-use addenda — substantially more permissive than the Llama community license or Gemma's Google AUP
Mixture-of-experts efficiency: 11 billion active parameters per token deliver knowledge capacity of a 198 billion total budget at inference cost closer to a smaller dense model
Native vision-language from the first token: Visual Search tool plus on-graph image reasoning, no separate adapter pipeline required
Three reasoning modes in one checkpoint: Quick lookups, intermediate planning, and deep tool-use workflows without model swaps
Day-one hosted inference partners: OpenRouter, NVIDIA NIM, and the StepFun Open Platform available at launch; DeepInfra, Fireworks AI, and Modal queued for fast follow
Agentic-coding benchmark parity at fractional cost: Advisor Mode claims roughly 97 percent of Claude Opus 4.6 coding performance at approximately one-ninth the per-task cost
Multiple checkpoint formats: BF16, FP8, NVFP4, and GGUF support both research-grade and cost-optimized deployments out of the box

Limitations & Considerations

Not a frontier-capability model: Claude Opus 4.8, GPT-5.5, and Gemini 3.5 Pro maintain meaningful leads at the absolute capability ceiling; Step 3.7 Flash is positioned as the agentic-deployment alternative, not the flagship competitor
Newer lab with smaller community: StepFun is well-established in the Chinese open-weights tier but has fewer English-language tutorials, integration guides, and third-party tooling than Llama or Mistral derivatives
Limited parametric knowledge for the 11 billion active budget: The model is tuned to retrieve facts via tool calls rather than memorize them; deployments without robust tool wiring may feel less knowledgeable than the parameter count suggests
Hosted inference quality varies by provider: Hosted endpoints on OpenRouter, NVIDIA NIM, and partner platforms may use different quantization choices — benchmark scores and behavior can shift between providers
China-based lab with US regulatory considerations: Customers deploying Step 3.7 Flash in regulated US contexts (government, defense, certain financial verticals) should verify their compliance posture with respect to Chinese-origin open-weights models before production deployment

Best Use Cases

Task	Why Step 3.7 Flash
Cost-sensitive agentic coding workflows	Advisor Mode at roughly one-ninth the per-task cost of Claude Opus 4.6 with comparable coding scores
Search agents with image inputs	Native Visual Search tool plus 79.16 percent SimpleVQA score with search
Long-context tool-use loops	256,000-token context window with three reasoning modes covers planning, retrieval, and synthesis turns in one model
Open-weight deployment without license friction	Apache 2.0 with no use restrictions — self-host, fine-tune, redistribute without negotiating with StepFun
Multi-provider inference strategy	Same Apache 2.0 checkpoint across OpenRouter, NVIDIA NIM, DeepInfra, Fireworks AI, and Modal for failover and cost optimization

When to choose alternatives:

Absolute frontier capability ceiling for hosted use → Claude Opus 4.8, GPT-5.5, Gemini 3.5 Pro
Largest open-weight community and tooling → Llama 4 derivatives or Mistral Large 3
On-device deployment with minimal memory footprint → Liquid LFM2.5-8B-A1B
Strongest Chinese open-weights chat experience → Kimi K2.6 or DeepSeek V4

Getting Started

Try the hosted endpoint first — sign in at platform.stepfun.com for a StepFun API key, or use OpenRouter if you already have an OpenRouter account
Pick a reasoning mode — start with medium for general workflows; switch to low for high-volume routing or high for deep multi-step debugging
Wire the Visual Search tool if your workflow involves image or document inputs — Step 3.7 Flash is tuned to invoke it rather than memorize knowledge
For self-hosted deployment — download the open weights from the stepfun-ai Hugging Face organization; pick a checkpoint format (BF16 for fidelity, FP8 or NVFP4 for cost, GGUF for llama.cpp) and a serving runtime (vLLM, SGLang, llama.cpp, or NVIDIA NIM)
Benchmark on your workload before committing — vendor numbers are reference-deployment scores; validate against your actual prompts, runtime, quantization, and reasoning mode before production rollout

Key Takeaways

Step 3.7 Flash is the May 2026 flagship open-weights release from Shanghai-based StepFun — a 198 billion total parameter mixture-of-experts vision-language model with roughly 11 billion active parameters per token
The model exposes a 256,000-token context window, three selectable reasoning modes, and reported throughput up to 400 tokens per second under hosted inference
Apache 2.0 weights with no use restrictions plus day-one hosting on the StepFun Open Platform, OpenRouter, and NVIDIA NIM close the integration gap that previously slowed open-weight adoption
Benchmark wins focus on agentic coding (56.26 percent on SWE-Bench Pro) and search workflows (79.16 percent on SimpleVQA with Search) — Advisor Mode reaches roughly 97 percent of Claude Opus 4.6's coding performance at approximately one-ninth the per-task cost
Best suited for cost-sensitive agentic coding, search agents with image inputs, long-context tool-use loops, and any open-weight deployment that needs Apache 2.0 permissiveness
Sits alongside Liquid LFM2.5, Kimi K2.6, and DeepSeek V4 as one of the labs building explicitly for practical agentic deployment rather than frontier capability ceiling

Step 3.7 Flash (StepFun)

Audio & video lessons are paid features