Harness Engineering: Stop Blaming the Model, Fix the Environment
I’ve been deploying AI agents for over a year. Claude Code for daily engineering work. Custom agents for research synthesis and content workflows. Every time one failed, the same reaction: “The model isn’t good enough. Maybe Opus will fix this. Maybe the next release.”
The coding agent would declare “done” on features that broke in production. The research agent would summarize articles it hadn’t fully read. The customer support agent would hallucinate policy details that didn’t exist. The content agent would lose all context between sessions and redo work from scratch.
Every single time, I blamed the model.
Then I found Learn Harness Engineering, an open-source course by WalkingLabs that synthesizes research from both OpenAI and Anthropic into a structured 12-lecture curriculum with hands-on projects.
It explained every failure mode I’d been hitting. Not as model limitations. As environment defects.
The core thesis is uncomfortable: when your agent fails, the problem is almost never the model. It’s everything around it. This applies whether your agent writes code, handles support tickets, conducts research, or orchestrates multi-step business workflows.
The Experiments That Ended the Debate
Anthropic ran a controlled experiment that should stop every “just upgrade the model” conversation. Same prompt (“build a 2D retro game maker”), same model (Opus 4.5):
| Setup | Runtime | Cost | Result |
|---|---|---|---|
| Bare run, no support | 20 min | $9 | Core features broken |
| Full harness (planner + generator + evaluator) | 6 hours | $200 | Fully playable game |
They didn’t change the model. They changed the harness.
OpenAI ran an even more extreme version. They used Codex to build a complete internal product from an empty git repository. Five months, roughly one million lines of code, ~1,500 PRs merged. Three engineers driving the whole thing. The hard constraint: humans never write code directly.
Early progress was slower than expected. Not because Codex lacked capability, but because the environment wasn’t complete enough.
The engineers’ job became: breaking large goals into small building blocks, ensuring the agent had sufficient tools and abstractions, then using completed blocks to unlock more complex tasks. When something failed, the fix was never “try harder.” It was always “what capability is the agent missing, and how do we make it both understandable and executable?”
This principle extends far beyond coding. A customer service agent that keeps escalating solvable tickets? It’s not a model problem. It’s missing access to the knowledge base, policy documents, or decision trees it needs. A research agent that hallucinates citations? It lacks proper retrieval tools and verification steps. A sales agent that gives incorrect pricing? The pricing data isn’t in its context.
This is harness engineering. Everything in the operating infrastructure outside the model weights. If it’s not model weights, it’s harness.
I’ve written about how Claude Code works under the hood before. The architecture is sound. But architecture without a proper operating environment is a thoroughbred without a saddle.
Five Layers Where Things Actually Break
The course maps agent failures to five distinct layers in Lecture 01. After reading all 12 lectures, I realized every frustration I’d had fit cleanly into one of these. These layers apply to any agent, not just coding ones.
1. Task specification. You say “add a search feature” to a coding agent. It hears something completely different. You say “handle refund requests” to a support agent. Handle how? Full refund? Partial? Store credit? What’s the threshold?
You didn’t specify, so the agent guesses. A correct guess is luck. A wrong one costs more to fix than being specific would have cost upfront.
2. Context provision. Your team standardized on SQLAlchemy 2.0 syntax, but that rule only exists in a Slack message from three months ago. Your refund policy changed last quarter, but the knowledge base still has the old version. Your research agent needs to cite only peer-reviewed sources, but nobody told it that.
The agent literally cannot see rules that aren’t in its operating context. Information not explicitly provided does not exist.
3. Execution environment. For coding agents: incomplete dev environment, missing dependencies, wrong tool versions. For other agents: broken API connections, expired credentials, inaccessible databases, rate-limited tools.
The agent burns precious context on infrastructure failures instead of solving your actual problem.
4. Verification feedback. No tests, no quality checks, or verification never communicated. The coding agent writes code and decides “it’s fine.” The support agent sends a response without checking policy compliance. The research agent synthesizes findings without verifying sources.
The 2017 ICML paper by Guo et al. proved neural networks are systematically overconfident. Your agent is no exception. It’s a student grading their own exam.
5. State management. Long tasks spanning sessions lose everything. Every new session re-explores from scratch.
Anthropic’s data shows failure rates spike sharply on tasks exceeding 30 minutes without state persistence. This is equally brutal for a multi-day research synthesis agent that forgets which papers it already analyzed.
The Five-Subsystem Framework
The course presents a kitchen analogy in Lecture 02 for what a complete harness looks like. Five functional areas, each with a clear responsibility:
Instructions (recipe shelf) → System prompts, policy docs, decision trees
Tools (knife rack) → APIs, databases, retrieval systems, executors
Environment (stove) → Reproducible infrastructure, locked configs
State (prep station) → Progress tracking, session handoff, checkpoints
Feedback (quality check) → Verification loops, quality gates, eval criteria
For coding agents specifically, this maps to CLAUDE.md, shell access, Docker, PROGRESS.md, and test suites.
For a customer support agent: tone guides and escalation rules, CRM and knowledge base access, consistent API configs, ticket state tracking, and CSAT scoring.
For a research agent: methodology constraints, search and retrieval tools, source databases, synthesis progress files, and citation verification.
A team using GPT-4o on a TypeScript + React frontend app went through four stages, adding subsystems one at a time:
| Stage | What was added | Success rate |
|---|---|---|
| Empty kitchen | Basic README only | 20% |
| Recipe shelf installed | AGENTS.md with stack + conventions | 60% |
| Quality check opened | Verification commands listed | 80% |
| Prep station ready | Progress file templates | 80-100% |
The model never changed. Success rate went from 20% to near 100%.
The same progression applies to any agent type. A support agent with just a system prompt performs far worse than one with policy documents, escalation rules, verification steps, and state tracking layered in incrementally.
The course recommends “isometric model control” to find your bottleneck: keep the model fixed, remove one subsystem at a time, measure which removal causes the biggest performance drop. That’s where you focus first.
The Knowledge Base Is the Only Truth
Lecture 03 makes a point that hit hard: anything the agent can’t access in its knowledge base, for all practical purposes, doesn’t exist.
For coding agents, this means the repository. For support agents, the policy documents and CRM data. For research agents, the source databases and methodology guides. For sales agents, the pricing sheets and competitive intelligence.
The principle is universal: your agent is locked inside whatever information you’ve made available to it. Everything outside that boundary, it knows nothing about.
Your Slack history, Jira tickets, Confluence pages, the policy change discussed in last week’s all-hands. The agent can’t access any of it. It can’t “go ask someone.” Everything outside its accessible context is a blind spot.
The course introduces a “cold-start test” to evaluate completeness. Open a fresh agent session with only its available context and see if it can answer five questions:
- What is this system (or domain)?
- How is it organized?
- How do I operate (run, respond, research)?
- How do I verify correctness?
- Where is the project (or workflow) right now?
If it can’t answer all five, your map has blank spots. Where the map is blank, the agent guesses. Wrong guesses become bugs, bad responses, or hallucinated information.
A team maintaining ~30 microservices had architecture decisions scattered across Confluence, Slack, and senior engineers’ heads. After introducing AI agents, 70% of tasks required human intervention. Nearly every failure involved the agent violating some “everyone knows but nobody wrote down” constraint.
After moving critical decisions into repo files (AGENTS.md, module-level ARCHITECTURE.md, explicit CONSTRAINTS.md), the same agent could answer all key questions on cold start.
The same pattern plays out in non-coding domains. A support team found their agent giving outdated refund policies because the updated rules lived in a Google Doc nobody connected to the agent’s context. A research workflow kept missing key sources because inclusion criteria existed only in a team lead’s head.
This principle extends to design decisions too. I explored a similar idea in DESIGN.md for consistent AI-generated UI: if the agent can’t see your constraints, it invents its own.
Don’t Write a Giant Instruction File
Here’s a trap I fell into myself. I’ve written before about treating docs as infrastructure for Claude Code. But there’s a dark side.
You start adding rules every time the agent makes a mistake. For coding agents, your CLAUDE.md grows from 100 lines to 600. For support agents, your system prompt bloats with edge cases. For research agents, your methodology doc becomes a dissertation. Then the agent actually gets worse.
The “Lost in the Middle” paper (Liu et al., 2023) proved that LLMs utilize information in the middle of long texts significantly less than at the beginning or end. A critical constraint buried at line 300 of a 600-line instruction set has a very high probability of being ignored.
Lecture 04 covers this failure mode in detail.
The solution: progressive disclosure. Keep your entry instructions at 50-100 lines with just the domain overview, core operations, global hard constraints (no more than 15), and links to topic-specific documents. Each topic document is 50-150 lines, organized by subject. The agent only loads them when relevant.
This applies universally. A coding agent’s AGENTS.md links to separate files for API conventions, testing standards, and deployment rules. A support agent’s system prompt links to escalation policies, refund procedures, and product-specific FAQs loaded on demand. A research agent’s instructions link to methodology guides and source evaluation criteria pulled in per task.
A SaaS team trimmed their 600-line AGENTS.md to 80 lines with linked topic documents. Same task set, success rate went from 45% to 72%. Security constraint compliance went from 60% to 95%.
Initialize Before Working
Lecture 06 introduced a pattern I now use every time. The first session on any new agent workflow does only initialization. No actual work output. Just infrastructure:
- Runnable environment (all tools connected, no errors)
- Verifiable quality gate (at least one example check passing)
- Bootstrap contract document (tells next session how to operate and continue)
- Task breakdown (ordered list with acceptance criteria per task)
- Clean checkpoint (stable state to start from)
For coding agents, this means dependencies installed, tests passing, and a clean commit.
For support agents: knowledge base connected, escalation paths verified, sample tickets processed correctly.
For research agents: retrieval tools working, source access confirmed, methodology documented, sample queries validated.
Anthropic’s data: projects using a dedicated initialization phase showed 31% higher feature completion rates in multi-session scenarios. The time invested in initialization is fully recovered in the next 3-4 sessions.
WIP=1: Do Less, Finish More
This was the most counterintuitive insight from Lecture 07. Anthropic’s data shows agents using a “small next step” strategy (WIP=1) have a 37% higher task completion rate than agents given broad prompts.
More surprising: output volume is weakly negatively correlated with tasks actually completed. More output produced, fewer working results.
The failure is familiar across every agent type. A UC San Diego study found that experienced developers succeed precisely because they constrain scope and verify every step. Agents need the same discipline externally imposed.
You tell Claude Code to “add user authentication” and it modifies the database schema, writes routes, changes frontend components, and refactors error handling. Two hours later: nothing works end-to-end.
You tell a research agent to “analyze the competitive landscape” and it tries to cover pricing, features, market positioning, customer sentiment, and future roadmaps simultaneously. Result: shallow analysis across everything, depth on nothing.
%%{init: {"layout": "dagre"}}%%
flowchart LR
Queue["Task queue"] --> Pick["Pick exactly one task"]
Pick --> Active["Only one active item"]
Active --> Verify["Run verification"]
Verify -->|pass| Commit["Mark complete, unlock next"]
Verify -->|fail| Active
Commit --> Queue
A REST API project comparison:
| Strategy | Code produced | Files touched | E2E pass rate | Features completed (3 sessions) |
|---|---|---|---|---|
| Unconstrained (5 features at once) | ~800 lines | 12 files | 20% | 3 of 8 |
| WIP=1 (one feature at a time) | ~200 lines per feature | 4 files per feature | 100% per feature | 7 of 8 |
The fix in your harness:
## Work Rules
- Work on one task at a time
- Only start the next after current passes end-to-end verification
- Don't "also handle" task B while completing task A
Task Lists as the Backbone
The course treats task lists not as planning tools but as harness primitives. Every task item needs a triple: behavior description, verification method, and current state.
The harness uses this file for scheduling (pick next not_started item), verification (run the check), handoff (auto-generate session summaries), and progress tracking.
For a coding agent:
{
"id": "F03",
"behavior": "POST /cart/items with {product_id, quantity} returns 201",
"verification": "curl -X POST localhost:3000/api/cart/items -d '{\"product_id\":1}' | jq .status",
"state": "passing",
"evidence": "commit abc123"
}
For a support agent:
{
"id": "S07",
"behavior": "Refund requests under $50 resolved without escalation",
"verification": "Response cites policy section, amount within threshold, no escalation flag",
"state": "passing",
"evidence": "ticket #4521 resolved correctly"
}
For a research agent:
{
"id": "R02",
"behavior": "Literature review covers all papers from 2023-2026 in target journals",
"verification": "Cross-reference extracted papers against journal TOCs for date range",
"state": "in_progress",
"evidence": "12 of 18 journals processed"
}
The key constraint: the agent cannot mark a task as “passing” by itself. Only the verification method executing successfully triggers the state transition. This eliminates the overconfidence problem at the system level.
The Premature Victory Problem
The course dedicates two full lectures to this (Lecture 09 and Lecture 10). Neural networks are systematically overconfident (proven by Guo et al., ICML 2017). When you ask an agent to evaluate its own work, it systematically gives overly positive evaluations.
The same model generating and evaluating inherently favors being generous to itself.
This isn’t just a coding problem. A support agent declares a ticket “resolved” when it sends a response, not when the customer’s issue is actually fixed. A research agent marks a section “complete” after writing a summary, not after verifying that the sources support the claims. A sales agent reports a lead as “qualified” based on surface-level criteria without confirming budget and timeline.
The three-layer termination check (generalized beyond code):
- Layer 1: Format validation. Output meets structural requirements. Correct schema, required fields present, constraints satisfied.
- Layer 2: Functional verification. Output actually works. Tests pass, responses are accurate, actions complete successfully.
- Layer 3: System-level confirmation. End-to-end validation. The user’s actual problem is solved, not just the intermediate step.
For coding agents, this maps to lint/typecheck, test execution, and E2E testing. For support agents: response format is correct, policy compliance verified, customer confirmation received. For research agents: citations formatted, sources verified against claims, synthesis reviewed against methodology.
The course gives a clear example from a coding context: an Electron app’s file export feature. Renderer tests passed (file operations mocked). Preload tests passed (filesystem mocked). Service tests passed (data source mocked). Five cross-component defects found only by end-to-end testing. Interface mismatches, state propagation errors, resource leaks. Unit tests caught zero.
One more critical tactical pattern from OpenAI: error messages for agents must include fix instructions.
Not "Test failed" but "Test failed: POST /api/reset-password returned 500. Check that email service config exists in environment variables. Template file should be at templates/reset-email.html."
Not "Response rejected" but "Response rejected: refund amount exceeds $50 threshold. Escalate to tier-2 or request manager approval per policy doc section 3.2."
This turns failures into a self-correcting loop.
The Amnesiac Craftsman
For long-running tasks, Lecture 05 recommends treating the agent like a brilliant specialist with amnesia.
Whether it’s writing code, researching a topic, or handling a complex multi-step workflow, the problem is identical: each new session starts from zero unless you build in continuity. Before each session ends, it writes:
# PROGRESS.md
## Current State
- Latest commit: abc1234 (feat: add user preferences endpoint)
- Test status: 42/43 passing (test_pagination_edge_case failing)
## Completed
- [x] User model and database migration
- [x] Basic CRUD endpoints
## Known Issues
- test_pagination_edge_case returns 500 on empty result sets
## Next Steps
1. Fix pagination edge case bug
2. Add "include deleted users" query parameter
Measured impact: rebuild time reduced 78%, feature completion rate from 58% to 100%, hidden defect rate from 43% to 8%.
Anthropic also discovered “context anxiety.” When agents sense context is running low, they rush to finish, skip verification, and choose simple solutions over optimal ones.
On Sonnet 4.5, this behavior is severe enough that context reset (starting a fresh session with handoff files) outperforms context compaction. On Opus 4.5, compaction works fine.
Harness design needs to account for the specific model, not follow a one-size-fits-all template.
Clean Handoff: Entropy Is the Default
Without active cleanup, agent-managed workflows degrade fast. Lecture 12 cites data from a 12-week project:
| Metric | Without cleanup strategy | With cleanup strategy |
|---|---|---|
| Build pass rate (week 12) | 68% | 97% |
| Test pass rate (week 12) | 61% | 95% |
| New session startup time | 60+ min | 9 min |
| Stale artifacts | 103 | 11 |
Every session must leave a clean state across five dimensions: build passes, tests pass, progress recorded, no stale artifacts, standard startup path works.
The course also makes a non-obvious point: harness components should be periodically simplified. As models improve, constraints that were essential three months ago may be unnecessary overhead now.
Monthly, pick one component, disable it, run benchmarks. If results don’t degrade, remove it permanently.
Observability Changes Everything
Lecture 11 covers making agent runtime observable. Without observability, retries are blind guesses. The agent doesn’t know why something failed, so it tries repeatedly in wrong directions.
This is equally true for a coding agent hitting a test failure, a support agent getting rejected responses, or a research agent producing low-quality synthesis.
The fix: sprint contracts (written scope + verification standards before work starts) and evaluator rubrics (structured scoring instead of “does it look right?”).
This connects to what I covered in 15 patterns for production AI agents. Observability isn’t optional.
A team comparison: without observability, 45 minutes and 3-4 blind retries for a barely acceptable result. With full observability, 15 minutes and one iteration for high-quality output. 3x efficiency difference.
What I Actually Changed
After working through this material, I made specific changes to how I deploy agents across my workflows:
1. Verification commands in every agent’s harness. Not “check if it works” but the exact verification steps, in order, with pass criteria. For coding agents: test commands in CLAUDE.md. For research agents: source verification checklists. For content agents: quality rubrics with specific scoring. This is the highest-ROI single change. If I could only do one thing, this is it.
2. WIP=1 as a hard rule in task instructions. “Complete only this task. Do not touch anything else. Run end-to-end verification. Do not declare done until it passes.” Works for coding features, research sections, support ticket categories. The quality improvement was immediate and dramatic.
3. Progress files for multi-session work. Before ending any session that isn’t fully complete, the agent updates a progress file with current state, decisions made, known issues, and next steps. The next session picks up in 3 minutes instead of 15. This applies to any multi-session agent workflow, not just code.
4. Initialization as a separate first session. For new agent deployments, the first session only sets up infrastructure. No actual output. Environment running, quality gates configured, task list created, bootstrap contract documented. Every subsequent session is faster because the foundation is solid.
5. Dedicated task lists with verification commands. Not prose descriptions. Structured entries with expected behavior, check command (or evaluation criteria), and state. The agent cannot declare “passing” without the verification step succeeding.
The Bottom Line
The harness engineering mental model is deceptively simple. When things fail, check the environment before blaming the model. Five subsystems determine how much model capability gets realized. A well-structured operating context might be more effective than upgrading to a more expensive model.
This applies whether you’re building coding agents, customer support bots, research assistants, sales automation, or any workflow where an LLM operates with autonomy. The failure modes are identical. The fixes are structural, not model-dependent.
The Learn Harness Engineering course is free, open-source, available in 12 languages, and backed by research from OpenAI and Anthropic. It has 12 lectures (each answering one core question), 6 hands-on projects building a real Electron app, and a resource library with copy-ready templates.
If you’re deploying AI agents in any capacity, start with Lecture 01 and work through the curriculum. The investment is a few hours of reading. The return is fundamentally more reliable agent output.
The strongest model in the world still fails without a proper environment around it. Fix the harness. Then the model works.
Deploying AI agents in your workflows? I’d love to hear what harness patterns are working for your team and where agents still break down, whether that’s coding, support, research, or something else entirely. Reach out on LinkedIn.