This post is based on the following video:
https://youtube.com/watch?v=Xxuxg8PcBvc
TL;DR
- An AI agent is not just a model — it's
Agent = Model + Harness, where the harness is every bit of scaffolding around the weights (system prompts, tool contracts, memory, verification loops) - The harness is where most real-world performance comes from now; a good harness on an older model often beats a bare frontier model
- Mature harness work is a craft of subtraction — the goal is the minimum scaffolding that keeps the model on rails, not the maximum
- The same idea shows up as a folder structure (Claude Code setups) or as an engineering discipline (IBM's 7 skills for agent builders) — the abstraction is identical even when the tooling looks different
- If you're building for yourself, start with a filesystem-based harness; the moment another human or adversarial input enters the picture, layer on reliability and security engineering
The dominant narrative is wrong about where agent performance comes from
For most of 2024 and 2025, the default story about AI agents went like this: agents are bottlenecked by model capability, so the path to better agents is better models. Wait for GPT-5. Wait for Claude Opus 4.x. Wait for the next frontier.
That narrative is collapsing. The people actually shipping production agents have converged on a different view: the bottleneck isn't the model, it's everything around the model. The field has a name for this now — harness engineering — and it's quietly become the most important skill in applied AI.
This post pulls together three videos that describe the same shift from different angles: a theoretical framing of what a harness is (PY's Rethinking AI Agents), a concrete filesystem-based implementation (Simon Scrapes' Claude Code Setup), and a production-engineering checklist (IBM's 7 Skills You Need to Build AI Agents). They disagree about the details in interesting ways, but they all point at the same underlying insight.
Agent = Model + Harness
The core definition is deceptively simple:
Agent = Model + Harness
The model is just the weights. The harness is everything else: the system prompt, tool definitions, orchestration logic, memory management, verification loops, retry policies, safety guardrails, output validators. If you took the model out and dropped in a different one, the harness is what stays.
This definition matters because it reframes where performance comes from. Anthropic's engineering team put it plainly in their recent post on harness design: every component in a harness encodes an assumption about what the model can't do on its own. A harness full of retry loops, planner/critic splits, and verification checks is telling you — very loudly — what the model is bad at. As models improve, some of those assumptions go stale, and parts of the harness become dead weight.
flowchart LR
subgraph Harness
SP[System prompt]
TD[Tool definitions]
MEM[Memory & state]
VER[Verification loops]
MODEL[Model weights]
end
USER[User intent] --> SP
SP --> MODEL
TD --> MODEL
MEM --> MODEL
MODEL --> VER
VER -->|retry| MODEL
VER -->|approved| OUT[Output]
Notice that the model sits inside the harness, not beside it. The harness is the environment the model inhabits, not a wrapper on top of it.
The two failure modes every harness has to solve
PY names two failure modes that every agent engineer will recognize immediately:
Oneshotting is when the agent tries to do everything in a single turn. You ask it to build a feature, and it writes the spec, the code, the tests, and the migration all at once — and because it's trying to do too much, each piece is worse than if it had done them one at a time. This is the model's default behavior when the harness doesn't impose structure.
Premature completion is when the agent declares victory at 60% done. It writes the happy path, runs it once, and announces the task is complete. The edge cases, the error handling, the follow-up checks — all skipped, because nothing in the harness forced the agent to keep going.
Both of these are harness problems, not model problems. A smarter model on a bad harness will still oneshot and still declare premature victory — it'll just produce slightly more polished garbage. The fix is structural: decompose the task into phases the harness enforces, require verification before the agent can mark work complete, externalize progress to a file the agent has to read and update between steps.
This matches exactly what Anthropic documents in their own harness work. They solve premature completion by having an initializer agent write out a comprehensive feature-requirements file at the start, which every subsequent agent has to read before it can claim completion. The file is the harness's enforcement mechanism — the agent can't fake progress against a written spec.
Harness as directory structure
Here's where harness engineering gets concrete. One of the videos walks through a Claude Code setup that implements the entire harness as a folder tree. No framework, no orchestration code — just Markdown files the agent reads on every invocation.
The structure looks like this:
/project-root
├── claude.md # Root operating instructions
├── brand-context/ # Voice, ICP, positioning
│ ├── voice.md
│ ├── icp.md
│ └── positioning.md
├── agent-context/ # Persona and interaction rules
│ ├── agent.md
│ └── user.md
├── project-memory/ # Plans, decisions, history
│ ├── current-plan.md
│ └── decisions.md
├── skills/ # Reusable expertise modules
│ ├── research/
│ │ ├── skill.md
│ │ └── learnings.md
│ └── draft-post/
│ ├── skill.md
│ └── learnings.md
└── review/ # Human checkpoint folder
Each folder maps to one of PY's abstract harness components:
claude.mdis the system promptbrand-context/is the context the agent loads when the task requires understanding who the business is forproject-memory/is file-backed state (instead of relying on the conversation window)skills/<name>/skill.mdis a tool contract — inputs, outputs, processskills/<name>/learnings.mdis a verification-loop artifact — it accumulates corrections over timereview/is the final checkpoint that solves premature completion
The sneaky part is the four-layer split of memory. The naive approach is to stuff everything into one big claude.md and hope the model attends to the right parts. This creates context rot — the model starts ignoring rules that aren't relevant to the current task, and the rules that are relevant get diluted. The split into four layers lets each skill pull in only the context it needs. A research skill might load brand-context/ but skip agent-context/; a reply-drafting skill might do the reverse.
This is the same principle that Anthropic's context engineering work has formalized: context is a resource to be managed, not a bucket to be filled. Every token in the prompt is a token competing for the model's attention.
Self-improving skills and the meta-harness
The second interesting pattern from the Claude Code setup is skills that learn. Every skill has a learnings.md file alongside its definition. When the agent finishes a task, it explicitly asks the user "did I get this right?" — and if there's a correction, it writes the correction to learnings.md. Next time the skill runs, it reads the updated rules.
┌─────────────────────┐
│ Skill runs │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Asks for feedback │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Writes correction │
│ to learnings.md │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Next run reads │
│ updated rules │
└─────────────────────┘
This is a manual, file-based version of the meta-harness idea from PY's video. The industrial version uses a proposer model (Anthropic's examples use Opus) that reads failed execution traces, diagnoses the failure, and writes a new harness. An evaluator agent then tests the new harness against held-out tasks to see if it actually improved things.
Either way — hand-rolled with a Markdown file or automated with a proposer/evaluator loop — the principle is the same: the harness is an optimization target, not a fixed configuration. You're doing ML research on your orchestration logic, not hand-tuning prompts.
Chained workflows with human checkpoints
The third pattern worth pulling out is how automated workflows should be structured. The naive approach to scheduled automation is to run a prompt on a timer. "Every morning at 8am, run this prompt." This fails badly, because a bare prompt on a schedule produces unsupervised output that nobody reviews until it goes live.
The better pattern is to chain skills together and terminate in a review folder:
flowchart LR
CRON[Scheduled trigger] --> S1[Skill: research]
S1 --> S2[Skill: outline]
S2 --> S3[Skill: draft]
S3 --> REVIEW[/review/ folder]
REVIEW -->|human approves| PUBLISH[Publish]
REVIEW -->|human edits| PUBLISH
REVIEW -->|rejected| ARCHIVE[Archive]
The automation does 80% of the work — research, narrowing topics, drafting — and the final output lands in a review folder for human approval before it goes anywhere. This is the cheapest possible solution to premature completion: you don't need the agent to recognize when its output is good enough, because a human does that step manually.
The generalization is: any scheduled workflow must terminate in a human checkpoint, not a publish action. This is a harness-level constraint, not a prompt-level one. The prompt can say "only publish high-quality content" all it wants — the structural protection is that the publish action isn't available to the agent at all.
Two worldviews, one problem
The IBM "7 Skills" video approaches the same territory from a very different angle. Where the Claude Code setup is a filesystem with four layers of context, IBM's framing is a list of seven engineering disciplines:
- System design — agents are orchestras of components, treat them like distributed systems
- Tool and contract design — every tool needs a strict input/output schema
- Retrieval engineering — chunking, embeddings, re-ranking for RAG
- Reliability engineering — retries with backoff, timeouts, circuit breakers, fallback paths
- Security and safety — prompt injection defenses, input validation, permission boundaries
- Evaluation and observability — tracing every decision, test pipelines, success metrics
- Product thinking — communicating confidence, graceful failure, knowing when to escalate
These two worldviews disagree in interesting ways. Here's where they land:
They agree on two things. Tool contracts must be strict — IBM says vague schemas let the agent "use its imagination, which is dangerous," and the filesystem harness enforces the same discipline through explicit skill.md definitions. They also agree that when an agent fails, your first move is to trace the failure, not to tweak the prompt. IBM formalizes this as observability engineering with tracing pipelines; the filesystem setup does it with review folders and explicit feedback capture.
They disagree sharply on RAG. IBM treats retrieval engineering as a core discipline. The filesystem-based harness doesn't discuss RAG at all — because when your model has a 200k-token context window and your knowledge base is a few folders of Markdown, you just load the files. No vector DB, no chunking, no re-ranking. For knowledge bases that fit in a context window, the simplest possible retrieval (read the whole folder) beats an engineered RAG pipeline, because RAG introduces retrieval errors that context-stuffing avoids. For knowledge bases spanning millions of documents, IBM is right and you need proper retrieval.
Reliability engineering is the biggest gap in the filesystem approach. IBM dedicates a whole discipline to retries, circuit breakers, and fallback paths. The solo-operator harness barely mentions them. This isn't because reliability doesn't matter — it's because a solo automation has a very different failure-cost profile than an agent serving paying customers. If your personal skill errors out, you re-run it. If IBM's agent errors out on a refund workflow, you have an angry customer and a support ticket.
Security is the cleanest divergence. IBM devotes a whole skill to prompt injection, input validation, output filters, and permission boundaries. The filesystem harness runs on your own machine, reads your own files, and acts on your own behalf — there's no adversarial user to defend against. The threat model is fundamentally different. But the moment that harness is shared with a team or exposed to outside input (an email parser, a public webhook, a shared Slack connector), IBM's security chapter suddenly applies and the harness needs to grow.
flowchart TD
START[Are you the only user?] -->|Yes| SOLO[Solo-operator harness]
START -->|No| TEAM[Team/production harness]
SOLO --> F1[Filesystem + folders]
SOLO --> F2[Markdown-based skills]
SOLO --> F3[Review folders]
TEAM --> P1[All of solo, plus...]
P1 --> P2[Reliability: retries, circuit breakers]
P1 --> P3[Security: input validation, permissions]
P1 --> P4[Observability: tracing, metrics]
P1 --> P5[Potentially RAG if KB is large]
The synthesis
The synthesis across all three videos: harness engineering is a single idea wearing different costumes for different contexts. PY gives you the theory — Agent = Model + Harness, with the harness as the optimization target. Simon gives you a concrete filesystem implementation that works for a single user. IBM gives you the production checklist that matters when you have to ship.
If you're building for yourself, start with the filesystem structure. Four memory layers, skills with learning files, scheduled workflows that terminate in a review folder. This gets you most of the way there and costs nothing but some mkdir commands.
The moment another human depends on the output, or the moment adversarial input can reach the system, graft IBM's skills 4, 5, and 6 on top. Add retry logic and timeouts around external calls. Add input validation and permission boundaries. Add tracing so you can figure out what broke when something breaks. The harness grows to match the threat surface.
And — this is the part that most people get wrong — don't build the harness all at once. Anthropic's engineers explicitly describe harness work as iterative: build the simplest thing that works, find the specific failure mode that matters, add the minimum scaffolding to solve it, repeat. Every component you add is an assumption about what the model can't do, and those assumptions need to be questioned regularly because models are getting better. The harness you need for Claude 4.5 may be too heavy for Claude 5; pieces that were load-bearing last year are dead weight now.
The most interesting shift isn't that harness engineering has replaced prompt engineering. It's that the most valuable artifact in AI engineering is no longer the prompt — it's the harness. Prompts are throwaway, tied to specific model versions and specific tasks. A good harness is portable across models, testable against held-out tasks, and accumulates improvements over time. It's the part of the system that has durable value.
If you're spending more time tweaking prompts than designing the structure around them, you're probably optimizing the wrong thing.
Citations
- Rethinking AI Agents: The Rise of Harness Engineering (PY): https://youtube.com/watch?v=Xxuxg8PcBvc
- The Claude Code Setup Nobody Shows You (Simon Scrapes): https://youtube.com/watch?v=c2kJ7j3CgUs
- The 7 Skills You Need to Build AI Agents (IBM Technology): https://youtube.com/watch?v=mtiOK2QG9Q0
- Harness design for long-running application development (Anthropic Engineering): https://www.anthropic.com/engineering/harness-design-long-running-apps
- Effective harnesses for long-running agents (Anthropic Engineering): https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
- Scaling Managed Agents: Decoupling the brain from the harness (Anthropic Engineering): https://www.anthropic.com/engineering/managed-agents
- Skill Issue: Harness Engineering for Coding Agents (HumanLayer): https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents