Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
7 min read·Updated May 31, 2026

Step 3.7 Flash (StepFun)

StepFun logoBy StepFun

Step 3.7 Flash is the May 2026 flagship open-weights release from Shanghai-based StepFun — a 198 billion total parameter mixture-of-experts vision-language model with roughly 11 billion active parameters per token, a 256,000-token context window, up to 400 tokens per second of throughput, and Apache 2.0 weights. The model targets agentic coding and search workflows with reported parity against frontier models at a fraction of the per-task cost.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand how Step 3.7 Flash fits in the open-weights tier alongside Liquid LFM, Kimi K2.6, and DeepSeek
  • Identify the mixture-of-experts architecture choice and what it means for inference cost and throughput
  • Evaluate when Step 3.7 Flash is the right choice for agentic coding, search, and vision-language workflows

What Is Step 3.7 Flash?

Step 3.7 Flash is the headline open-weights release from StepFun (also written Stepfun), a Shanghai-based AI lab founded in 2023. It is a mixture-of-experts vision-language model engineered for agentic deployment — coding agents, search workflows, and multi-step tool use — rather than chasing raw frontier capability against the largest US labs. Released in late May 2026 under the Apache 2.0 license, the model ships with day-one hosted inference on the StepFun Open Platform, OpenRouter, and NVIDIA NIM, with DeepInfra, Fireworks AI, and Modal next in line.

The naming convention matters: Flash in StepFun's product line denotes the high-efficiency variant tuned for throughput and cost rather than the absolute capability ceiling. Step 3.7 Flash succeeds Step 3.5 Flash and sits in the same competitive band as Kimi K2.6, DeepSeek V4 Flash, and Liquid LFM2.5 — Chinese-or-MIT open-weights labs competing on the practical agentic-deployment dimension rather than the closed-source frontier.

Tip

Try Step 3.7 Flash: platform.stepfun.com — StepFun Open Platform with hosted inference; also available on OpenRouter and NVIDIA NIM. Open weights downloadable from the stepfun-ai Hugging Face organization.

Architecture and Specifications

The current flagship is Step 3.7 Flash, released open-weight in late May 2026. The architecture pairs a 196 billion parameter language backbone with a 1.8 billion parameter Vision Transformer (ViT), making the model natively multimodal from the first token.

SpecificationStep 3.7 Flash
Total parameters198 billion (196 billion language plus 1.8 billion vision)
Active parameters per tokenRoughly 11 billion via mixture-of-experts routing
Context window256,000 tokens
ThroughputUp to 400 tokens per second on hosted inference
LicenseApache 2.0 — no use restrictions
Checkpoint formatsBF16, FP8, NVFP4, and GGUF
Reasoning modesThree selectable levels (low, medium, high)

The 198 billion total parameter budget with roughly 11 billion active per token puts Step 3.7 Flash in the same architectural class as DeepSeek V4 and Kimi K2.6 — a sparse mixture-of-experts design that delivers the knowledge capacity of a much larger dense model at the inference cost of a smaller one. Three selectable reasoning levels let the same checkpoint cover quick lookups, intermediate planning, and deep multi-step tool-use workflows without swapping models.

💡Key Concept

Why mixture-of-experts for agentic workloads. An agentic loop alternates between cheap routing decisions (which tool to call next, which sub-goal to pursue) and expensive reasoning calls (compose the response, debug the failure). A sparse mixture-of-experts model can spend the cheap-call budget at low active-parameter cost and reserve the expensive-call budget for the few moments where it matters. The architecture is what makes Step 3.7 Flash's Advisor Mode pricing competitive with much smaller dense alternatives without sacrificing the capability ceiling on the hard turns.

Benchmark Performance

StepFun reports the following benchmark wins for Step 3.7 Flash:

BenchmarkStep 3.7 Flash
SWE-Bench Pro (coding)56.26 percent
Terminal-Bench 2.1 (coding)59.55 percent
SimpleVQA with Search (search agents)79.16 percent
ClawEval-1.1 (general agents)67.07 percent

The SimpleVQA with Search score puts Step 3.7 Flash effectively at parity with GPT-5.5 on that benchmark, and the lab claims its Advisor coding mode reaches roughly 97 percent of Claude Opus 4.6's coding performance at approximately one-ninth the per-task cost. The ClawEval-1.1 result outperforms DeepSeek V4 Flash on the same general-agents benchmark.

⚠️Warning

Open-weight benchmark caveat. Vendor-reported numbers reflect the configuration the vendor ran. Hosted inference on OpenRouter or NVIDIA NIM may use different quantization, runtime, or prompt-formatting choices than the StepFun Open Platform reference deployment, and scores can shift several points in either direction. Validate against your actual deployment configuration — same runtime, same quantization, same context length, same reasoning mode — before committing to a production pattern.

Vision-Language Capabilities

The 1.8 billion parameter Vision Transformer is native to the architecture, not a separate adapter — image inputs flow through the same token stream as text. The headline application is the built-in Visual Search tool, which lets the model invoke a search call against an image (chart, diagram, screenshot, document page) and continue reasoning over the retrieved context. The design partially compensates for the limited parametric knowledge that comes with the moderate active-parameter budget: rather than memorizing facts, the model is tuned to retrieve them via tool calls.

Pricing

Open WeightsFree
  • Download Step 3.7 Flash from Hugging Face
  • No use restrictions
  • Self-host, fine-tune, redistribute derivatives
Hosted inferencePer-token pricing varies by host
  • Available on OpenRouter, NVIDIA NIM, StepFun Open Platform
  • Pay per million input plus output tokens
  • No infrastructure to manage
EnterpriseContact StepFun
  • Custom deployment support
  • Volume pricing and SLAs
  • Dedicated capacity

For most evaluators the right starting point is the StepFun Open Platform or OpenRouter hosted endpoint — fastest path from sign-up to working API calls. For data-residency-sensitive deployments or sustained heavy use, the open-weight tier on Hugging Face removes any per-token cost and lets you self-host on your chosen accelerator stack (BF16 for highest fidelity, FP8 or NVFP4 for cost, GGUF for llama.cpp deployments).

Strengths

  • Apache 2.0 open weights: No use restrictions, no acceptable-use addenda — substantially more permissive than the Llama community license or Gemma's Google AUP
  • Mixture-of-experts efficiency: 11 billion active parameters per token deliver knowledge capacity of a 198 billion total budget at inference cost closer to a smaller dense model
  • Native vision-language from the first token: Visual Search tool plus on-graph image reasoning, no separate adapter pipeline required
  • Three reasoning modes in one checkpoint: Quick lookups, intermediate planning, and deep tool-use workflows without model swaps
  • Day-one hosted inference partners: OpenRouter, NVIDIA NIM, and the StepFun Open Platform available at launch; DeepInfra, Fireworks AI, and Modal queued for fast follow
  • Agentic-coding benchmark parity at fractional cost: Advisor Mode claims roughly 97 percent of Claude Opus 4.6 coding performance at approximately one-ninth the per-task cost
  • Multiple checkpoint formats: BF16, FP8, NVFP4, and GGUF support both research-grade and cost-optimized deployments out of the box

Limitations & Considerations

  • Not a frontier-capability model: Claude Opus 4.8, GPT-5.5, and Gemini 3.5 Pro maintain meaningful leads at the absolute capability ceiling; Step 3.7 Flash is positioned as the agentic-deployment alternative, not the flagship competitor
  • Newer lab with smaller community: StepFun is well-established in the Chinese open-weights tier but has fewer English-language tutorials, integration guides, and third-party tooling than Llama or Mistral derivatives
  • Limited parametric knowledge for the 11 billion active budget: The model is tuned to retrieve facts via tool calls rather than memorize them; deployments without robust tool wiring may feel less knowledgeable than the parameter count suggests
  • Hosted inference quality varies by provider: Hosted endpoints on OpenRouter, NVIDIA NIM, and partner platforms may use different quantization choices — benchmark scores and behavior can shift between providers
  • China-based lab with US regulatory considerations: Customers deploying Step 3.7 Flash in regulated US contexts (government, defense, certain financial verticals) should verify their compliance posture with respect to Chinese-origin open-weights models before production deployment

Best Use Cases

TaskWhy Step 3.7 Flash
Cost-sensitive agentic coding workflowsAdvisor Mode at roughly one-ninth the per-task cost of Claude Opus 4.6 with comparable coding scores
Search agents with image inputsNative Visual Search tool plus 79.16 percent SimpleVQA score with search
Long-context tool-use loops256,000-token context window with three reasoning modes covers planning, retrieval, and synthesis turns in one model
Open-weight deployment without license frictionApache 2.0 with no use restrictions — self-host, fine-tune, redistribute without negotiating with StepFun
Multi-provider inference strategySame Apache 2.0 checkpoint across OpenRouter, NVIDIA NIM, DeepInfra, Fireworks AI, and Modal for failover and cost optimization

When to choose alternatives:

  • Absolute frontier capability ceiling for hosted use → Claude Opus 4.8, GPT-5.5, Gemini 3.5 Pro
  • Largest open-weight community and tooling → Llama 4 derivatives or Mistral Large 3
  • On-device deployment with minimal memory footprint → Liquid LFM2.5-8B-A1B
  • Strongest Chinese open-weights chat experience → Kimi K2.6 or DeepSeek V4

Getting Started

  1. Try the hosted endpoint first — sign in at platform.stepfun.com for a StepFun API key, or use OpenRouter if you already have an OpenRouter account
  2. Pick a reasoning mode — start with medium for general workflows; switch to low for high-volume routing or high for deep multi-step debugging
  3. Wire the Visual Search tool if your workflow involves image or document inputs — Step 3.7 Flash is tuned to invoke it rather than memorize knowledge
  4. For self-hosted deployment — download the open weights from the stepfun-ai Hugging Face organization; pick a checkpoint format (BF16 for fidelity, FP8 or NVFP4 for cost, GGUF for llama.cpp) and a serving runtime (vLLM, SGLang, llama.cpp, or NVIDIA NIM)
  5. Benchmark on your workload before committing — vendor numbers are reference-deployment scores; validate against your actual prompts, runtime, quantization, and reasoning mode before production rollout

Key Takeaways

  • Step 3.7 Flash is the May 2026 flagship open-weights release from Shanghai-based StepFun — a 198 billion total parameter mixture-of-experts vision-language model with roughly 11 billion active parameters per token
  • The model exposes a 256,000-token context window, three selectable reasoning modes, and reported throughput up to 400 tokens per second under hosted inference
  • Apache 2.0 weights with no use restrictions plus day-one hosting on the StepFun Open Platform, OpenRouter, and NVIDIA NIM close the integration gap that previously slowed open-weight adoption
  • Benchmark wins focus on agentic coding (56.26 percent on SWE-Bench Pro) and search workflows (79.16 percent on SimpleVQA with Search) — Advisor Mode reaches roughly 97 percent of Claude Opus 4.6's coding performance at approximately one-ninth the per-task cost
  • Best suited for cost-sensitive agentic coding, search agents with image inputs, long-context tool-use loops, and any open-weight deployment that needs Apache 2.0 permissiveness
  • Sits alongside Liquid LFM2.5, Kimi K2.6, and DeepSeek V4 as one of the labs building explicitly for practical agentic deployment rather than frontier capability ceiling

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you