AI glossary

Eval / evaluation suite

A set of test cases run weekly against an AI system to catch quality regressions before users do. Non-optional for production. We build one for every shipped system.

The longer version

20+ test cases at launch, growing weekly with production failures. Each case has expected behavior (not just expected output), scoring rubric, and pass threshold. Run in CI on every prompt change. Failing the eval blocks merge. Named owner on the client side. See /playbooks/eval-suite for the full pattern.

Related terms

LLM-as-judge
Using a strong model to evaluate the outputs of another model against your criteria. Used for eval suites at scale when human grading isn't tractable.

Want to talk about how this applies to your stack?

Book a 20-min call →Browse all terms

More terms

Agent
Agentic workflow
BAA (Business Associate Agreement)
Cache (prompt caching)
Citations / grounding
Context window