The eval suite you actually keep running
20 test cases, replayed weekly, evolving with production. The minimum for shipping.
An eval suite isn't optional. It's the system that tells you whether your production AI is still doing what it did on launch day. Without it, you ship a great demo and watch it silently degrade. This is the eval pattern we build into every engagement.
1. Start with 20 cases. Not 200.
20 hand-picked cases beat 200 synthetic ones. The 20 cover happy path, common edge cases, and the patterns you can't afford to break.
We collect them from real conversations (anonymized), pre-launch interviews, and team brainstorming. Each case has expected behavior, not just expected output.
2. Score what matters
Hallucination rate (LLM-as-judge against ground truth). Refusal correctness (did the agent refuse when it should have?). Tool-call accuracy. Citation correctness.
Aggregate scores are useful, but per-case scoring catches the cases where averages hide real failures.
3. Run weekly, gate releases
Eval runs in CI on every prompt change. Failing the eval blocks merge. Engineers learn fast that prompts have a regression risk.
Outside CI, we run the full suite weekly against production. Drift over time shows up here before it shows up in user complaints.
4. Grow the suite from production
Every failure mode you catch in production becomes a permanent eval case. The suite grows 2-5 cases per week in active development.
Anonymize aggressively before adding production cases — strip PII, names, account numbers.
5. LLM-as-judge, then humans on the close calls
Have a strong model grade outputs against your criteria. Score 80%+ of cases automatically.
The close calls and the failures go to humans. That's where your tuning judgment lives.
Common failure mode
Treating evals as one-time. They're a permanent system, not a launch checklist.
Common skip
Not having an owner for the eval suite. It atrophies in weeks without one.