Name: Patronus AI
Availability: InStock
Author: Patronus AI

Learning Objectives

Understand the problem Patronus AI solves: how do you reliably test an AI model or agent before you trust it in production?
Identify Patronus's core products — Lynx, Glider, Percival, and Digital World Models — and what each one does
Evaluate where automated AI evaluation helps and where human judgment still has to stay in the loop

What Is Patronus AI?

Patronus AI is a San Francisco company that builds tools for testing AI — checking whether large language models and AI agents actually behave the way you need them to before you put them in front of customers. It was founded in 2023 by two former Meta AI researchers, Anand Kannappan and Rebecca Qian, around a gap that grew alongside generative AI: teams were shipping models far faster than they could reliably evaluate them.

The core idea is to turn "does this model behave?" from a manual spot-check into continuous, automated measurement. Patronus scores AI outputs for hallucinations, safety problems, and policy violations, and benchmarks them against custom criteria that match a specific company's domain — a bank cares about different failure modes than a hospital or a law firm.

💡Key Concept

Why AI evaluation is hard: A traditional software test has a right answer — the function returns 5 or it does not. An AI model's output is open-ended language, so "correct" is fuzzy: it might be fluent but subtly wrong, or right but unsafe. Evaluation tools like Patronus use a mix of specialized scoring models, rule checks, and benchmarks to grade open-ended outputs at scale, instead of relying on a human to read every response.

✅Tip

Visit Patronus AI: patronus.ai — an enterprise platform; the Glider and Lynx evaluator models are also available as open weights.

Pricing & Access

Patronus is primarily an enterprise platform sold to AI teams, with a free evaluation API tier for smaller projects and open-weight evaluator models anyone can run. Detailed platform pricing is quoted per customer.

Plan	Price	Features
Open models	Free	Glider and Lynx evaluator models as open weights Run them yourself Good for experimentation
Evaluation API	Free tier, then usage-based	Score outputs for hallucinations and safety Custom evaluation criteria Published benchmarks
Enterprise platform	Custom pricing	Full evaluation + monitoring suite Percival agent evaluation Digital World Models simulation

Open modelsFree

Glider and Lynx evaluator models as open weights
Run them yourself
Good for experimentation

Evaluation APIFree tier, then usage-based

Score outputs for hallucinations and safety
Custom evaluation criteria
Published benchmarks

Enterprise platformCustom pricing

Full evaluation + monitoring suite
Percival agent evaluation
Digital World Models simulation

Core Capabilities

Automated Output Evaluation

Patronus scores model responses against criteria you define — factual accuracy, safety, tone, adherence to a policy — so a team can measure quality across thousands of responses instead of hand-checking a sample. This is the foundation the rest of the products build on.

Lynx — Hallucination Detection

Lynx is a model purpose-built to catch hallucinations: cases where an AI states something that is not supported by its source material. It compares a response against the context it was given and flags claims that are not grounded, which is especially important for retrieval-augmented generation (RAG) systems where the whole point is to answer from trusted documents.

Glider — A Compact Open Evaluator

Glider is a small open-weight model — only a few billion parameters — that acts as an automated judge, scoring text against user-defined criteria. Despite its size, Glider has outperformed much larger general-purpose models on evaluation tasks, which matters because running a giant model to grade every output would be slow and expensive. A small, sharp evaluator makes continuous testing affordable.

Percival — Agent Evaluation

Percival is an evaluation copilot for AI agents. Instead of judging a single answer, it inspects an agent's full execution trace — the sequence of steps, tool calls, and decisions — and flags failure modes like poor planning, tool misuse, and reasoning errors. It can also suggest fixes, which is valuable as agents take on multi-step tasks where a single bad step early on derails everything after it.

Digital World Models — Simulation Before Deployment

Introduced with the June 2026 funding round, Digital World Models are large simulated environments that reproduce realistic failure conditions, so an agent can be stress-tested against things that resemble real-world workflows before it ever touches live systems, money, or customer data. The aim is to catch costly mistakes in a sandbox first — closer to a flight simulator for AI agents than a traditional test suite.

Domain Benchmarks

Patronus has published evaluation benchmarks the wider field uses, including FinanceBench for accuracy on financial questions and CopyrightCatcher for detecting when a model regurgitates copyrighted text.

Strengths

Purpose-built for AI testing: Evaluation is the whole product, not a side feature — the platform is designed around the messy reality of grading open-ended AI outputs
Strong founding pedigree: Founded by former Meta AI researchers, with frontier labs and major cloud providers among its customers
Open evaluator models: Glider and Lynx are available as open weights, so teams can inspect and run the judges themselves rather than trusting a black box
Agent-aware: Percival and Digital World Models target agent failure modes specifically, not just single-response quality — an increasingly important gap as agents take on real authority
Well-funded: A fifty million dollar Series B in June 2026 brought total funding to roughly seventy million dollars

Limitations & Considerations

The evaluator is itself a model: Automated judges can be wrong too — they can miss a subtle error or flag a correct answer. Evaluation reduces manual review, it does not eliminate the need for human oversight on high-stakes outputs
Setup effort: Getting real value means defining good, domain-specific criteria and benchmarks; a generic configuration gives generic signal
Crowded category: AI evaluation and observability is competitive — LangSmith, Weights & Biases, Datadog, and others overlap, so the right choice depends on what stack a team already runs
Enterprise-oriented: The deepest capabilities sit in the paid platform; the free tier and open models are a starting point, not the full product

Best Use Cases

Who	Why Patronus AI Matters	How They Engage
AI product teams	Catch hallucinations and unsafe outputs before users do	Wire evaluation into the deployment pipeline as a continuous check
Regulated industries (finance, healthcare, legal)	Domain accuracy and policy compliance are non-negotiable	Use custom benchmarks like FinanceBench plus tailored criteria
Teams building AI agents	A single bad step can derail a multi-step task	Use Percival to inspect agent traces and Digital World Models to stress-test before launch
Frontier labs and model makers	Need rigorous, repeatable evaluation at scale	Run open evaluators and large simulation environments against new models

When to choose alternatives:

If you mainly need experiment tracking and model-training dashboards, Weights & Biases is a closer fit
If your evaluation needs are tightly tied to a specific framework's traces, that framework's native tooling (for example LangSmith for LangChain apps) may integrate more cleanly
For broad production monitoring across a whole application — not just the AI layer — a general observability suite like Datadog may cover more ground

Key Takeaways

Patronus AI is an evaluation, security, and simulation platform that automatically tests large language models and AI agents before they reach production
Its products span the stack: Lynx detects hallucinations, Glider is a compact open evaluator model, Percival evaluates AI agents' execution traces, and Digital World Models simulate realistic conditions for stress-testing
Founded in 2023 by former Meta AI researchers, it raised a fifty million dollar Series B in June 2026 and counts leading AI labs and cloud providers among its customers
Automated evaluation makes testing affordable at scale, but the evaluator is itself a model — human oversight stays essential for high-stakes outputs
As companies hand AI agents real authority over systems and data, the ability to catch failures in a sandbox first is becoming a core part of the AI stack

Patronus AI

Audio & video lessons are paid features