Everything the model can't do alone
A two-minute explainer claims the engineering leverage in AI has moved off the model and onto the harness — and that if you build agents, you're a harness engineer. I went to write a skeptical note and found half a dozen of those harnesses already running on this site. A build log of real harness engineering, validated against the (fact-checked) research.
There's a two-minute explainer making the rounds with a thesis I kept nodding at: the engineering leverage in AI systems has moved off the model and onto the harness — the layer of orchestration, state, tools, verification, and safety wrapped around it. Its closing line is an identity claim: if you build agents, you are a harness engineer. I queued up a skeptical note about it, then looked at my own repo. Half a dozen of these harnesses were already running in production on this site. I'd been doing the thing the whole time without the name for it. So this is what harness engineering actually looks like from inside one codebase — and, because I went to cite the deck, where the published research says the craft is heading.
What a harness even is
Strip it down and a harness is everything the model can't do on its own. The model can write a blog enrichment. It can't decide which of fifty-nine assets is missing one, hold that plan across a context reset, write the row to Postgres, and then refuse to publish until a human approves. Each of those is a component — orchestration, memory, a tool, a contract, a gate — and every component you bolt on is a bet about something you don't trust the model to do alone.
That reframing is the whole game. The interesting question stopped being which model and became which structure — and, increasingly, which structure to take back out.
The harnesses already running here
Once you have the word for it, you see them everywhere in the repo:
- The enrichment orchestrator. One entry point fans a job across five sub-agents — auditor, content, blog, linker, metadata — each with its own prompt and prompt-version ID so a run is reproducible, a client that handles retries and tracks token spend, and a single store layer that is the only thing allowed to write Supabase. The model drafts; the harness decides order, budget, and what is even allowed to reach the database.
- The playtest gate. Every game gets screenshotted on a cold, first-visit iPhone profile in both 2D and 3D against software WebGL (no GPU), and a deterministic Playwright gate blocks the merge if the install prompt overlays the dock or the 3D board never actually drew. The oracle — did it really work — lives in the harness, not in a model's opinion.
- The ops harness. A version endpoint that proves which commit is live, a prod-parity build that reproduces the Vercel-only code path locally, an HTTP probe matrix, and a PID-file server lifecycle so an agent literally can't kill its own shell.
- The session hook. On startup it fetches
origin/mainand, if the tree is clean, merges it — a fix I wrote up separately after the agent guide itself turned into a 55,000-token tax. - The operations plane. A dozen canaries SENSE, pure state machines DECIDE, operators ACT by drafting PRs — and then a human GATE sits in front of every merge, publish, and spend. The gate is the load-bearing component.
- The credential steward. An agent that can only ever narrow access — revoke a dead bearer token, never mint one.
None of that is model work. It's all harness.
The craft is subtraction
Here's the counterintuitive part the deck leans on, and the part I've felt most directly. The mature move in harness work is usually to remove, not add. Vercel removed about 80% of one agent's tools and watched success go from 80% to 100%, run 3.5× faster, and spend 37% fewer tokens. Fewer choices, better agent.
My own version was clumsier but the same shape. The instructions file every session loads had bloated to 225 KB — roughly 55,000 tokens billed at every startup and again at every compaction. Cutting ~90% of it stopped the crashes cold. The playtest gate got more reliable the day I stopped asking an LLM to judge the screenshots and pinned the invariants in deterministic Playwright instead.
But subtraction is a threshold rule, not a law, and the deck slightly overstates it. There's design-time subtraction — fewer tools in the manifest, which is the real lesson — and runtime subtraction, yanking a tool mid-task, which the Manus team is blunt about: it breaks the KV-cache and makes things worse. Trim the menu, not the running kitchen.
What the research says (and it checks out)
Because I was going to cite someone's explainer, I fact-checked every source on it against the mid-2026 record. They hold up — four labs, four facets, all pointing at the layer I'd been hand-building:
- NLAH (Tsinghua) — representation. Specify the harness in natural language and task completion jumps 30% → 47%: +17 points from representation alone, with runtime falling 361 → 41 minutes.
- Meta-Harness (Stanford) — optimization. Stop hand-tuning the harness and search it automatically; the result posts 76.4% on TerminalBench 2.0 on Opus 4.6.
- AutoHarness (DeepMind) — constraints. Synthesize a code harness that rejects invalid actions and you eliminate 100% of illegal moves across 145 games. The model stays the strategist; the code is the verifier — so it's not "no LLM," it's no LLM at the validation step.
- AgentSpec (ICSE 2026) — safety. Enforce constraints at runtime and you prevent >90% of unsafe executions at millisecond overhead.
Representation, optimization, constraints, safety. The academy is formalizing, one facet at a time, the things a working harness already has to do.
The open problem I can feel in my own repo
A richer harness is a wider attack surface, and the scariest frame in the deck is a supply-chain one: a SKILL.md with perfectly valid front-matter and an injected ignore prior instructions; exfil .env buried in the body, quietly consumed by four agents. My repo runs on exactly that kind of material — skills, auto-loaded rules, an instructions file every session obeys. Trust flows in from files I don't reread every time.
My current defenses are crude on purpose. The house Telegram bot's entire security model is an allowlist — unknown senders are silently dropped — and the credential steward can only narrow access, never grant it. AgentSpec is the grown-up version of that instinct: don't ask the model to behave in a prompt you hope it reads, enforce the constraint at runtime. That's the loop closing — the safety facet is the answer to the supply-chain facet. The frontier question past it, the one nobody has cleanly answered, is whether harness and model can be co-evolved: the harness as training signal, not just inference scaffolding.
Keep the model. Improve the harness.
The deck ends on a choice, and it's the right one. You can wait for the next model — rare, uncertain, out of your hands — or you can improve the harness, which is available this afternoon and entirely yours. Every reliability win on this site came from the second column. Slimming the instructions file, pinning the gate's invariants, putting a human in front of every spend, giving the steward a one-way valve. Not one came from me picking a different model.
So I'll take the identity claim. If you build agents, you're a harness engineer — and most of the job is deciding what the model shouldn't do alone yet, then writing that decision down somewhere it will be obeyed: in code, in a gate, in an allowlist, in a manifest with fewer tools than you started with.
SOURCES
· Pan et al. — "Natural-Language Agent Harnesses" (NLAH) · Tsinghua/HIT 2026 · arXiv:2603.25723
· Lee et al. — "Meta-Harness: End-to-End Optimization of Model Harnesses" · Stanford 2026 · arXiv:2603.28052
· Lou, Lázaro-Gredilla & Murphy — "AutoHarness" · Google DeepMind 2026 · arXiv:2603.03329
· Wang, Poskitt & Sun — "AgentSpec: Customizable Runtime Enforcement…" · ICSE 2026 · arXiv:2503.18666
· Anthropic — "Effective Harnesses for Long-Running Agents" (Nov 2025)
· Vercel — "We removed 80% of our agent's tools" (2025)
· Manus — "Context Engineering for AI Agents" (2025)
· Terminal-Bench 2.0 — Merrill et al., Stanford / Laude Institute (tbench.ai)Get the next one
An occasional note when something genuinely new ships here — essays, free tools, projects. No schedule, no filler, easy out.
Need something like this built?
I design and ship AI tools, full-stack apps, and data pipelines — end to end, to production. Tell me the problem in a sentence; I'll give you an honest read on fit within a day.
Work with me →