Playbook8 min6 pages

Production agent design

How to design agents that ship — guardrails, escalation, observability.

Most teams stall agents at the demo stage because they design for the happy path and figure out exceptions later. We design exception-first. This is the playbook we use on every agent engagement.

1. Define the action surface first

Before writing prompts, list every action the agent can take. Most teams skip this and inherit a sprawling tool list as it grows. Bad outcome.

Build the list collaboratively with the human currently doing the workflow. They'll surface actions you'd miss — and reject ones you'd over-include.

Each action is a verb-object pair: 'create-ticket', 'lookup-account', 'send-email'
Each has an explicit input schema and a clear failure mode
Each is tagged as read / write / external — controls authorization later

2. Wrap every write action in approval

v1 agents don't take write actions autonomously. They draft, request approval, then execute. Period.

Approval can be inline (chat confirmation), routed to a queue, or auto-approved against a policy — but it's explicit. The agent's success metric is the approve-rate, not the auto-execute count.

3. Build the escalation path before the happy path

Every agent has a 'I don't know — route to a human' branch. Building that first forces clarity on what the agent should refuse, what counts as ambiguous, and where humans queue up.

Confidence thresholds matter. We tune them after seeing the first 100 production conversations, not before.

4. Observability is non-optional

Every prompt, retrieval, tool call, and output gets logged with reasoning. We use Langfuse by default. You don't need fancy dashboards — you need replayable traces when something goes sideways.

Weekly review cadence with the operations owner. We've never run an agent without one and we've never plan to.

5. Eval before launch, eval after launch

20+ test cases in the eval suite from week one. Replay against every prompt change. We don't ship prompt edits without a green eval.

After launch, the eval suite grows weekly with cases from production we want to never fail again.

Common failure mode

Teams build a generic 'AI assistant' that does too many things. Specialized > general for v1. One workflow, done well.

Common skip

Building agents without observability. You'll regret it the first time something breaks at 11pm on Saturday.