Skip to content

15 Patterns That Keep Production AI Agents From Burning Down Prod

11 min read

Your agent demo works perfectly. It calls tools, reasons through problems, and delivers impressive results. Then you deploy it.

Suddenly it’s retrying a dead API 4,000 times. It’s spending $200 in tokens on a single user request. It’s mutating production data without guardrails. Three agents are fighting over the same ticket, each convinced the other two are wrong.

The gap between “agent that works” and “agent you can trust in production” isn’t about smarter models. It’s about the same boring infrastructure patterns that have kept distributed systems alive for decades, adapted for a world where your “microservice” can hallucinate.

Here are 15 patterns that close that gap. I’ve grouped them into four categories: resilience, containment, architecture, and operations.

The Resilience Stack

These four patterns work together to handle failures gracefully instead of cascading them.

1. Agent Circuit Breaker

A circuit breaker stops an agent from hammering a dependency that is clearly failing. Provider outage, persistent 5xx, tool bug. Without one, your agent will retry into oblivion, burning tokens and cascading the failure downstream.

The pattern has three states:

%%{init: {"layout": "dagre"}}%%
stateDiagram-v2
    [*] --> Closed
    Closed --> Open : failure threshold crossed
    Open --> HalfOpen : cooldown expires
    HalfOpen --> Closed : probe succeeds
    HalfOpen --> Open : probe fails

In the closed state, requests flow normally. When error rates cross a threshold (say, 5 failures in 60 seconds), the circuit opens and all new requests fail fast with a controlled error. After a cooldown, the circuit moves to half-open and probes with limited traffic to test recovery.

You can also apply this concept to safety. Research on representation-level circuit breaking shows you can detect harmful internal activations and short-circuit a model before it emits unsafe content. For production agents, that means guardrail layers that terminate runs when certain patterns appear (policy violations, repeated jailbreak attempts, calls to dangerous tools) instead of relying purely on output filters.

Design checklist:

  • Track rolling failure metrics (error rate, timeouts, provider-specific 4xx/5xx) per backend or tool
  • Model the three states explicitly. Don’t just do “retry 3 times then give up”
  • Combine with retries and fallbacks: transient issues get retried, sustained issues trip the breaker

2. Tool Invocation Timeout

Without timeouts, a slow or hung dependency stalls your entire agent run. The user sees a spinner. Your token meter keeps ticking. Nothing happens.

Set per-tool timeout budgets based on realistic latency data:

Tool Type              Typical p95    Timeout Budget
──────────────────────────────────────────────────────
Database query         50ms           150ms
Web search             800ms          2,000ms
Code execution         2,000ms        5,000ms
LLM sub-call           3,000ms        8,000ms
External API           500ms          1,500ms
──────────────────────────────────────────────────────
Overall wall clock     —              30,000ms

Set each tool’s timeout at 1.5-2x its typical p95. Add an overall “wall clock” limit per agent run so no single request can block indefinitely regardless of how many tools it chains.

The key insight: classify timeouts separately from logical errors. A timeout means “we don’t know what happened.” A 400 error means “the request was bad.” Different root causes, different remediation paths. When a tool crosses its timeout threshold repeatedly, couple it with the circuit breaker to stop trying altogether.

3. Idempotent Tool Calls

Retries are only safe if your tools can handle being called twice with the same inputs. Without idempotency, a retry after a timeout might double-charge a credit card, send a duplicate email, or create two Jira tickets.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    Agent -->|"op_id: abc-123"| Tool[Tool: Create Order]
    Tool --> Log["Operation Log\n(abc-123: completed)"]
    Agent -->|"retry op_id: abc-123"| Tool
    Tool --> Log
    Log -->|"already exists"| NoOp[Return cached result]

For read-only tools: they’re already idempotent. No changes needed.

For write tools: require a caller-supplied idempotency key. Store operation logs keyed by that ID so retries return the cached result instead of re-executing side effects.

For non-idempotent operations you can’t redesign (legacy APIs, third-party services): simulate idempotency with deduplication keys or “upsert” semantics at the integration layer.

4. Dead Letter Queue for Failed Runs

A dead letter queue (DLQ) holds agent runs that couldn’t complete after configured retry attempts. Instead of losing them or retrying forever, you park them for human triage.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    Queue[Task Queue] --> Agent
    Agent -->|success| Done[Completed]
    Agent -->|retry 1| Agent
    Agent -->|retry 2| Agent
    Agent -->|retry 3 fails| DLQ[Dead Letter Queue]
    DLQ --> Human[Human Review]
    Human -->|fixed| Queue

Why this matters for agents specifically: agent failures are messier than typical service failures. A failed run might have partial state, tool outputs from earlier steps, and a decision history that matters for debugging. Attach all of it as metadata.

  • Define per-task max attempts before DLQ. Three is a good default. Don’t allow unbounded retries
  • Build dashboards and alerts on DLQ volume. Spikes are early canaries for regressions
  • Once the underlying bug is fixed, forward repaired messages back to the main queue for reprocessing

The Containment Layer

These patterns limit the blast radius when an agent misbehaves.

5. Blast Radius Limiter

Even with circuit breakers and DLQs, you need hard caps on what an agent can do per request. Think of it as the agentic equivalent of IAM policies, rate limits, and spending quotas.

ResourcePer-Request LimitPer-Session LimitPer-Day Limit
LLM tokens8,00050,000500,000
Tool calls1050500
DB mutations520100
Emails sent1520
Estimated cost$0.50$5.00$50.00

Separate “read-only” and “write” environments. Reads get generous limits. Writes get strict limits and approval gates. When a limit is hit, alert and escalate to human review instead of silently dropping work.

Why this works: gateway-level observability makes this enforceable by tracking latency, token usage, and costs per route, user, or workflow. Limit breaches trigger automated shutdowns before they become incidents.

6. Confidence Threshold Gate

A confidence gate blocks risky actions when the model is uncertain and routes them to safer alternatives: ask a clarifying question, use a simpler flow, or escalate to a human.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Agent[Agent Decision] --> Score{Confidence\nScore}
    Score -->|"> 0.85"| Execute[Auto-execute]
    Score -->|"0.60 - 0.85"| Confirm[Ask User\nto Confirm]
    Score -->|"< 0.60"| Escalate[Route to\nHuman]

You can estimate confidence through model self-critique, external verifier models, or classifier heads. Anything below threshold gets treated as “not safe to automate.”

Design checklist:

  • Define per-route confidence thresholds and escalation policies before launch, not after an incident
  • Add secondary triggers beyond confidence scores: negative sentiment spikes, repeated user rephrasing, explicit request for a human
  • Log confidence scores alongside outcomes to tune thresholds over time. Start conservative, loosen as you gather data

7. Human Escalation Protocol

Human-in-the-loop (HITL) isn’t just “sometimes ask a human.” It’s a designed protocol for when, how, and where humans intervene.

Naive approaches escalate entire conversations when the AI gets stuck. Better approaches let humans answer targeted questions while the AI stays primary. One expert supporting many concurrent sessions, responding to precise prompts instead of reading raw logs.

TriggerEscalation TypeFormat
Confidence below thresholdTargeted question”Is this the correct account? Options: A, B, C”
High-risk topic (legal, medical)Full handoffComplete context summary + conversation history
Negative sentiment spikeCollaborativeAgent stays primary, human reviews in real-time
User requests humanImmediate transferWarm handoff with context

The critical feedback loop: capture human responses as labeled data so the agent improves at those exact edge cases over time. Every escalation should make the next one less likely.

Architecture Decisions

These patterns shape how you structure the overall system.

8. Orchestration vs Choreography

In orchestration, a central controller drives the whole workflow. It calls tools and agents in order, handles branching, retries, and compensation. The control flow lives in one place.

In choreography, agents react to events and each other without a central brain. Behavior emerges from event subscriptions and message flows.

DimensionOrchestrationChoreography
Control flowExplicit, centralizedEmergent, distributed
DebuggingFollow the conductorTrace across N services
Failure handlingOne place to catch errorsEach agent handles its own
CouplingConductor knows all agentsAgents only know events
Best forRegulated, business-critical flowsLoosely coupled, event-driven tasks

For most production agent systems, start with orchestration. You need explicit governance and observability over every step, especially in financial operations, customer-facing flows, or anything with compliance requirements.

Reserve choreography for loosely coupled agents that discover and react to events (notifications, enrichment, personalization). But back it with strong distributed tracing and DLQs, because debugging emergent behavior across independent agents is genuinely hard.

9. LLM Gateway

An LLM gateway is a layer that all model traffic passes through. It abstracts over multiple providers while centralizing routing, auth, quotas, and observability.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    A1[Agent 1] --> GW[LLM Gateway]
    A2[Agent 2] --> GW
    A3[Agent 3] --> GW

    GW --> |"routing\npolicy"| P1[OpenAI]
    GW --> P2[Anthropic]
    GW --> P3[Google]

    GW --> Trace[Trace Store]
    GW --> Budget[Cost Budget]
    GW --> Limits[Rate Limits]

Because the gateway sees every LLM call, it’s the natural place to enforce rate limits, cost budgets, circuit breakers, and audit policies without touching application code at each call site.

  • Use the gateway as the only way agents talk to models. No side-door credentials
  • Implement routing policies (cheaper models for simple tasks, low-latency models for real-time, fallbacks for outages) and log them as spans
  • Export traces via OpenTelemetry so LLM telemetry lines up with your existing infrastructure monitoring

10. Semantic Caching

In production, a large fraction of queries are paraphrases of previous ones. Without caching, you recompute full retrieval and generation pipelines every time.

A semantic cache stores results keyed by meaning, not exact string matches. It sits at the front of the agent loop: if the user’s intent matches a previous one closely enough, return the cached answer instead of re-executing the whole chain.

Query: "What's the refund policy?"
Cache: HIT (similarity: 0.94, threshold: 0.90)
        ↳ matched: "How do I get a refund?"
        ↳ saved: ~2,100 tokens, ~1.8s latency

Case studies report up to 80% cost and latency reductions for high-frequency query patterns. But the threshold tuning matters. Too aggressive and you serve stale or wrong answers. Too conservative and you barely save anything.

Cache in layers with different TTLs:

LayerTTLExample
LLM responses1-4 hoursFactual Q&A
Deterministic tool results5-15 minutesAPI lookups
Embeddings24-72 hoursDocument vectors
Session stateSession durationConversation context

State Management

These patterns keep long-running and multi-agent systems coherent.

11. Context Window Checkpointing

Long-running agents eventually hit context limits. Without strategy, they either truncate important history or blow up with oversized prompts.

Context checkpointing periodically distills the current agent state into a compact summary, then continues with a fresh context window seeded from that checkpoint.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    T1["Turns 1-20\n(raw history)"] --> CP1["Checkpoint 1\n(summary + facts)"]
    CP1 --> T2["Turns 21-40\n(raw history)"]
    T2 --> CP2["Checkpoint 2\n(summary + facts)"]
    CP2 --> T3["Turns 41+\n(current context)"]

Implementation approach:

  • Summarize every N turns or when token count nears 70-80% of the limit
  • Replace raw history with a model-generated summary plus key extracted facts (entity IDs, decisions made, constraints discovered)
  • Store durable state (task graph, external IDs, approval records) in a database, not prompt text. Prompts are volatile. Databases are not
  • Make checkpointing itself idempotent and observable so you can debug “lost context” bugs

12. Multi-Agent State Sync

As soon as you have multiple agents working on the same user, ticket, or dataset, you need a shared state model. Without one, you get conflicting actions and weird loops.

The fix: put the “world state” in a database or event log that agents read and write explicitly, rather than burying it inside each agent’s prompt history.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Agent1[Research Agent] --> |"read/write"| State[(Shared State\nDB + Event Log)]
    Agent2[Writing Agent] --> |"read/write"| State
    Agent3[Review Agent] --> |"read/write"| State
    State --> |"events"| Agent1
    State --> |"events"| Agent2
    State --> |"events"| Agent3
  • Use a single source of truth for shared entities (ticket, order, document) with optimistic locking or versioning
  • Have agents publish state changes as events that others consume, with idempotent handlers and DLQs for failed processing
  • Include state snapshots in traces so you can replay and debug multi-agent interactions after the fact

13. Replanning Loop

Most agent frameworks model a perception-decide-act cycle. But production designs need to make the “decide” step explicit: reconsider objectives, check constraints, and potentially regenerate the plan.

This matters when tools fail, quotas get exhausted, or the user updates their request mid-flow.

Step 1: Search API ✓
Step 2: Analyze results ✓
Step 3: Call pricing API ✗ (timeout)

→ REPLAN triggered
  - Drop step 3 (pricing API down)
  - Substitute step 3b: Use cached pricing data
  - Continue from step 4

Step 3b: Load cached prices ✓
Step 4: Generate report ✓

Design checklist:

  • Represent plans as explicit structures (lists of steps, DAGs) that can be edited programmatically, not buried in natural language
  • Define triggers for replanning: repeated tool failures, significant context changes, hitting blast-radius limits
  • Gate replanning by severity. Minor failures get a local patch. Critical assumption breaks trigger full replanning or escalation
  • Log plan versions in traces so you can compare “planned vs executed” when debugging behavior

Operational Safety

These patterns handle deployment, monitoring, and ongoing reliability.

14. Canary Agent Deployment

Canary deployment rolls out a new agent or prompt to a small subset of traffic first, then gradually increases exposure while monitoring metrics.

Day 1:  1% traffic → new agent v2.1
        ↳ monitoring: error rate, hallucination rate, cost, latency
Day 2:  5% traffic (metrics look good)
Day 3:  25% traffic
Day 5:  100% traffic

The same pattern from microservice deployments works for agents and prompts: assign a fraction of users to the new version, compare quality metrics against the control group, and roll back instantly if hallucinations, safety incidents, or latency spike.

  • Route internal users or low-risk segments to new agents first
  • Define success/failure thresholds on quality, safety, and cost. Automate rollback when thresholds are violated
  • Use your gateway and observability traces to compare canary vs baseline behavior in detail

15. Agentic Observability Tracing

Traditional logging captures final responses. Agentic observability captures every step: prompts, tool calls, retrievals, decisions, and outputs as linked spans in a trace.

Trace: user-request-abc123
├─ Span: llm-call (model: claude-4, tokens: 1,847, latency: 1,200ms)
├─ Span: tool-call:search (query: "refund policy", latency: 340ms)
├─ Span: tool-call:db-lookup (user_id: 42, latency: 12ms)
├─ Span: guardrail-check (policy: pii-filter, result: pass)
├─ Span: llm-call (model: claude-4, tokens: 892, latency: 680ms)
└─ Span: response (tokens: 234, total-latency: 2,450ms, cost: $0.03)

This level of tracing answers the questions that matter in production: which tool slowed this request, why a guardrail fired, how much a particular workflow costs, and where multi-agent interactions went off the rails.

  • Instrument agents so each LLM call, tool invocation, retrieval, and guardrail check is a span with correlation IDs
  • Export traces via OpenTelemetry to unify agent monitoring with your existing infrastructure stack
  • Build dashboards for latency, cost, error rate, and quality per route and agent version. Wire alerts to DLQs and circuit breakers

The Patterns Interlock

These 15 patterns aren’t isolated techniques. They form a layered defense system:

┌──────────────────────────────────────────────────────────┐
│  Operations: Canary Deployment + Observability           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  Architecture: Gateway + Orchestration + Cache      │ │
│  │  ┌────────────────────────────────────────────────┐ │ │
│  │  │  State: Checkpointing + Sync + Replanning      │ │ │
│  │  │  ┌───────────────────────────────────────────┐ │ │ │
│  │  │  │  Containment: Limits + Gates + HITL       │ │ │ │
│  │  │  │  ┌──────────────────────────────────────┐ │ │ │ │
│  │  │  │  │  Resilience: CB + Timeouts + DLQ     │ │ │ │ │
│  │  │  │  └──────────────────────────────────────┘ │ │ │ │
│  │  │  └───────────────────────────────────────────┘ │ │ │
│  │  └────────────────────────────────────────────────┘ │ │
│  └─────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘

The resilience stack (circuit breakers + timeouts + idempotency + DLQs) forms the core. Containment wraps around it. Architecture provides the structural framework. Operations gives you the visibility and deployment safety to evolve everything else.

You don’t need all 15 on day one. But you need to know where each one goes, because the gap between “agent that works” and “agent that survives production” is exactly these patterns.

Start with three: observability tracing (you can’t fix what you can’t see), blast radius limiters (cap the damage), and circuit breakers (stop cascading failures). Build outward from there as your agent system grows in complexity and traffic.


Building production agent systems? I’d love to hear which patterns saved you (or which ones you learned about the hard way). Reach out on LinkedIn.