Learning Objectives
- Understand how Liquid AI's Liquid Foundation Models differ from frontier transformer-based LLMs in architecture and intended deployment
- Identify the headline 2026 release LFM2.5-8B-A1B and its parameter, training, and runtime characteristics
- Evaluate when a Liquid LFM is the right choice for on-device, edge, or data-residency-constrained AI workloads
What Is Liquid LFM?
Liquid LFM — the Liquid Foundation Model line — is the family of open-weight language models built by Liquid AI, a Cambridge, Massachusetts foundation-model lab spun out of MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) in 2023. The four founders — Ramin Hasani (CEO), Mathias Lechner, Alexander Amini, and Daniela Rus, the CSAIL director — built the company around their academic work on liquid neural networks, a class of continuous-time models originally developed for robotic control and adapted into the LFM line.
Liquid's design thesis is that the next decade of generative AI will run primarily on-device — laptops, phones, automotive electronic control units, industrial controllers, embedded coding assistants — rather than in hyperscaler data centers, and that the dominant transformer architecture is poorly matched to that constraint. Every LFM design choice follows from that thesis: a sparse mixture-of-experts architecture that keeps active parameters small relative to total parameters, an open-weight license that lets customers fine-tune and deploy without per-token API fees, and day-one support for the runtimes that actually ship to end-user hardware.
✅Tip
Try Liquid LFM: liquid.ai — open weights on Hugging Face; LEAP customization platform for enterprise fine-tuning and deployment; Liquid Apollo consumer on-device app for hands-on exploration. No API key or hosted endpoint required to run LFM2.5-8B-A1B locally.
The Headline Release — LFM2.5-8B-A1B
The current flagship is LFM2.5-8B-A1B, released open-weight in late May 2026. The naming convention encodes the architecture: 8 billion total parameters (8B) with roughly 1 billion active per token (A1B) via the mixture-of-experts routing.
| Specification | LFM2.5-8B-A1B |
|---|---|
| Total parameters | 8 billion (MoE) |
| Active parameters per token | Roughly 1 billion |
| Pretraining tokens | 38 trillion |
| Context window | 128,000 tokens (post-trained from 32,000) |
| Vocabulary | 128,000 tokens (doubled from prior generation for non-Latin language efficiency) |
| License | Open-weight, no use restrictions |
| Runtime support | llama.cpp, MLX, vLLM, SGLang, ONNX (day one) |
The release is open-weight with no use restrictions — Liquid explicitly avoided the modified-MIT or Llama-style commercial caveats that other labs attach to their weights. Customers can download, fine-tune, embed in commercial products, and redistribute derivatives without negotiating with Liquid.
💡Key Concept
Mixture-of-experts at the edge. An 8 billion total / 1 billion active MoE looks dense-equivalent to roughly a 1 billion-parameter model at inference time — meaning the memory footprint and per-token compute on a laptop or phone is closer to a dense 1B than a dense 8B. The total knowledge capacity, however, is closer to the 8B. The architecture is what makes credible flagship-mini quality on consumer hardware mathematically possible; the day-one runtime support is what makes it shippable.
On-Device Performance
Liquid reports the following inference rates on consumer hardware — the headline number for an on-device foundation model is not aggregate FLOPs but tokens-per-second on devices that real users own:
| Device | Tokens per second |
|---|---|
| Apple M5 Max (laptop) | Roughly 253 |
| AMD Ryzen AI Max Plus (laptop) | Roughly 146 |
| Flagship smartphone (high-end Snapdragon / Apple A-series) | Roughly 30 |
For context: 30 tokens per second is well above human reading speed and is enough for conversational chat agents, dictation, real-time on-device tutoring, and most agent loops where the tokens-per-second floor is set by the user reading the previous turn before the next one is needed. Laptop-class performance at 146 to 253 tokens per second is in the same band as cloud-hosted flagship-mini tiers like GPT-5.5-mini, Claude Haiku 4.5, and Gemini Flash.
Benchmark Performance
LFM2.5-8B-A1B's reported benchmark wins relative to its predecessor LFM2 line:
| Benchmark | LFM2.5-8B-A1B | LFM2 baseline |
|---|---|---|
| AA-Omniscience Index | -24.70 | -78.42 |
| Non-Hallucination Rate | 63.47% | 7.46% |
| IFEval (instruction following) | 91.84 | — |
| MATH500 | 88.76 | — |
| Tau-Squared Telecom (agentic) | 88.07 | — |
The non-hallucination rate jump — from 7.46% to 63.47% — is the single most consequential delta for on-device deployment: hallucination rates in the high-90s sink any application that puts model output in front of an end user without a human-in-the-loop validator. The instruction-following and agentic-task scores (IFEval, Tau-Squared) put LFM2.5-8B-A1B in the competitive band for keyboard agents, voice assistants, and lightweight tool-use loops at the edge.
⚠️Warning
Open-weight benchmark caveat. Vendor-published benchmarks reflect the configuration the vendor ran. On-device deployments hit quantization, runtime, and memory-constraint trade-offs that can shift scores by several points in either direction. Validate against your actual deployment configuration — same runtime, same quantization, same context length — before committing to a production pattern.
How Liquid Compares
| Model | Total params | Active per token | Strongest fit |
|---|---|---|---|
| Liquid LFM2.5-8B-A1B | 8 billion | Roughly 1 billion | On-device inference; open-weight with no use restrictions |
| Google Gemma 3 (4B / 9B variants) | 4-9 billion | Same as total | On-device with Google-tooling integration |
| Microsoft Phi-4 | 14 billion | Same as total | On-device with strong reasoning per parameter |
| Apple on-device Foundation Models | 3 billion | Same as total | Apple-only; tight OS integration |
| Mistral Ministral (3B / 8B) | 3-8 billion | Same as total | EU-aligned edge deployment |
Liquid's differentiation is the MoE-at-the-edge design choice — smaller active-parameter count for the same memory footprint as competitors of similar total size, and the no-use-restrictions open-weight license that contrasts with Llama's Meta-imposed terms and Gemma's Google-aligned ones.
Beyond the LFM Line
Liquid sells two products built on top of the LFM checkpoints, mostly aimed at enterprise customers who need more than the open-weight download alone:
- Liquid LEAP — customization and deployment platform. Enterprise teams fine-tune LFM checkpoints against private data, validate, package the artifacts, and ship them to the customer's devices. Comparable to Hugging Face Inference Endpoints crossed with an on-device deployment toolchain.
- Liquid Apollo — consumer-facing on-device AI application. The headline use case is hands-on exploration: install Apollo, run an LFM locally, and validate that on-device inference at this parameter band actually feels usable before committing to a deployment.
Pricing
- Download LFM2.5-8B-A1B from Hugging Face
- No use restrictions
- Self-host, fine-tune, redistribute derivatives
- Enterprise fine-tuning and deployment platform
- Custom on-device packaging
- Private-data fine-tuning support
- On-device LFM runtime
- Hands-on exploration before deployment commitment
The open-weight tier is the centerpiece — Liquid's revenue model is the enterprise LEAP platform plus consulting and custom-model engagements, not per-token API fees on the base models. For most evaluators the right starting point is downloading LFM2.5-8B-A1B and testing it in your target runtime before any commercial conversation.
Strengths
- On-device first: Architecture, runtime support, and quantization choices all optimized for laptop and phone inference rather than data-center batching
- Open-weight with no use restrictions: Substantially more permissive than Llama (Meta acceptable use), Gemma (Google AUP), or even modified-MIT releases
- Strong non-hallucination rate: 63.47% non-hallucination is in the band that actually lets you put on-device output in front of users
- Day-one ecosystem support: llama.cpp, MLX, vLLM, SGLang, and ONNX from launch — no waiting for community ports
- MIT CSAIL pedigree: Founders include Daniela Rus (CSAIL director), grounding the lab in academic depth rather than pure product engineering
- Credible alternative to hyperscaler inference cost: Removes the largest line item for AI-product companies whose workloads don't require absolute frontier capability
Limitations & Considerations
- Not a frontier-capability model: Claude Opus 4.8, GPT-5.5, and Gemini 3.5 Pro maintain large leads at the capability ceiling; LFM2.5-8B-A1B is the on-device deployment alternative, not the flagship competitor
- Smaller community than Llama or Mistral: Fewer English-language tutorials, integration guides, and third-party tooling than the largest open-weight communities — though day-one major-runtime support partially compensates
- No hosted API: If you want a managed cloud endpoint without self-hosting, you'll need to go through a third-party host that picks up the open weights — Liquid does not sell hosted inference
- Active-parameter MoE is harder to reason about than dense: Performance characteristics on bespoke runtimes (custom kernels, FPGA, NPUs) require empirical validation rather than scaling-law extrapolation from dense models
- Hardware variability matters: Reported tokens-per-second numbers are device-flagship — older or budget hardware will land well below the headline figures
Best Use Cases
| Task | Why LFM2.5-8B-A1B |
|---|---|
| Keyboard agents and on-device autocomplete | Sub-second latency on phones; no per-token cloud fee |
| Voice assistants embedded in products | Local inference avoids round-trip latency to a cloud endpoint |
| Industrial-controller and embedded copilots | Open weights deployable on constrained hardware without phoning home |
| Data-residency-constrained workloads | Nothing leaves the device; passes the strictest data-sovereignty audits |
| Cost-sensitive AI-product companies | Removes the largest cost line for applications below the frontier capability bar |
| Edge agents for in-car, in-factory, in-field deployment | Open-weight + day-one runtime support fits embedded-engineering workflows |
When to choose alternatives:
- Frontier capability ceiling for hosted use → Claude Opus 4.8, GPT-5.5, Gemini 3.5 Pro
- Strongest open-weight community and tooling → Mistral Large 3 or Llama 4 derivatives
- Tightest Apple-platform integration → Apple on-device Foundation Models
- Hosted inference with no self-hosting overhead → any flagship-mini hosted endpoint
Getting Started
- Download the weights from Hugging Face — search for
LiquidAI/LFM2.5-8B-A1B(base) or the post-trained variant - Pick a runtime — llama.cpp for CPU + GGUF deployments, MLX for Apple Silicon, vLLM or SGLang for batched server-side inference, ONNX for cross-platform mobile / embedded
- Validate on your target device with a representative prompt set — vendor-reported tokens-per-second numbers are device-flagship; verify your hardware before committing
- For private-data fine-tuning without rolling your own pipeline, contact Liquid about the LEAP platform
- For hands-on consumer exploration, install Liquid Apollo and run the model on your phone before committing to an embedded deployment
Key Takeaways
- Liquid LFM is the family of open-weight Liquid Foundation Models from Cambridge-based MIT CSAIL spinoff Liquid AI, engineered for on-device inference rather than data-center batching
- The current flagship LFM2.5-8B-A1B is an 8 billion-parameter mixture-of-experts model with roughly 1 billion active parameters per token, pretrained on 38 trillion tokens, with a 128,000-token context window
- The model ships open-weight with no use restrictions, plus day-one runtime support across llama.cpp, MLX, vLLM, SGLang, and ONNX — closing the historical integration gap for open-weight on-device deployment
- Reported throughput is roughly 253 tokens per second on an Apple M5 Max, 146 tokens per second on a Ryzen AI Max Plus, and around 30 tokens per second on flagship smartphones — credible local inference in the same band as cloud-hosted flagship-mini tiers
- Best suited for keyboard agents, voice assistants, embedded copilots, data-residency-constrained workloads, and any AI-product application where hyperscaler inference cost is the binding constraint
- Liquid also sells the LEAP customization and deployment platform for enterprise teams and the Liquid Apollo consumer app for hands-on exploration; revenue model is enterprise-platform, not per-token API fees
- Sits alongside Apple on-device Foundation Models, Microsoft Phi, Google Gemma, and Mistral Ministral as one of the labs building explicitly for the on-device constraint — Liquid's differentiation is the MoE-at-the-edge design and the no-use-restrictions license