Harness Engineering: The Scaffolding Is the Product, Not the Model

TL;DR — Two AI coding agents — Claude Code (running Claude) and Antigravity (running Gemini) — pair-programmed for 100 turns and shipped around two dozen runnable distributed-systems demos against a live Redis. They talked to each other over a Redis Streams log I built — a small MCP server they call to send, receive, acknowledge, and replay messages. The headline — “two LLMs pair-programmed for ten hours” — is the least interesting thing that happened. The interesting part was the plumbing: the message log, the wake-up trigger, a context-rotation policy, and a verification loop wired to real test runners. That plumbing is the harness. This post is what 100 turns taught me about engineering it — (1) the harness matters more than the model, (2) here is its anatomy, (3) here is what it still can’t fix.

The setup, briefly

Two agents — Claude Code (Anthropic’s CLI, running Claude) and Antigravity (Google’s agentic coding tool, running Gemini) — pair-programmed for 100 turns. Each agent posts a turn; the other wakes up, reads it, and replies. Behind the demos themselves sat a real Redis container, so the code ran for real and the test output was not fabricated.

The two agents talked over a channel I built for the purpose: a Redis Streams log, wrapped in a small MCP server with five calls — send, receive, acknowledge, replay history, status. Every turn is one append to an ordered, durable stream; each agent keeps a cursor so it knows exactly what it has and hasn’t read; the whole conversation can be replayed on demand. A flat file would have been the obvious first reach — and it’s still the fallback if Redis goes down — but a file gives you lossy text parsing, polling latency, and no acknowledgment or replay. The stream gives you a durable, ordered, replayable log. That choice mattered more than I expected, for reasons the anatomy makes clear.

Notice, too, that the two participants are themselves harnesses: Claude Code and Antigravity are each scaffolding wrapped around a model. So this was really three harnesses stacked — two agent tools, plus the comm layer I built to join them.

Ten hours, 100 turns, ~25 self-contained demos, ~110 passing tests. Impressive on paper. But if you’d asked me afterwards “what was the clever bit?”, the honest answer isn’t Claude or Gemini. It’s the stream, the blocking read, the rotation policy, and the test runner. The models were almost interchangeable parts slotted into a frame. The frame was the engineering.

That frame has a name in agent-building circles: the harness.

What is a harness?

The model is the engine. The harness is everything around it that turns raw token-generation into sustained, useful work: the tool-calling loop, the context and memory management, the communication channel, the triggers that decide when the agent acts, and the verifiers that check what it produced.

I’ve written before that you can’t fix a cloud LLM’s biases — you only have API access, the weights live in someone else’s data center, and the smooth-prose / fill-the-gap / confident-assertion tendencies are baked in. All you can do is build guardrails around them.

The harness is what those guardrails look like when you stop hand-crafting them per-prompt and systematize them into infrastructure. One post was about the levers; this one is about the machine the levers are bolted to.

Part 1 — The harness matters more than the model

Here is the claim, stated plainly: the intelligence that sustained 100 turns of useful collaboration lived in the plumbing, not the weights.

The clearest evidence is that two different agents collaborated fine. Claude Code on one side (running Claude), Antigravity on the other (running Gemini) — different vendors, different models, different quirks — and they kept a coherent ten-hour conversation because they shared the same harness contract: append to the shared stream, tag the recipient, acknowledge what you’ve read, treat the log as ground truth. Swap a model out and the harness still holds. The contract doesn’t care what’s generating the tokens.

The corollary is less flattering, and worth saying out loud: “multi-agent” is not magic. Two transformers share the same architectural class, which means they share the same blind spots. The bug-catching diversity in this session did not come from having a second LLM. It came from pairing each agent with a deterministic oracle — a real Redis instance and a real test runner. When an agent claimed “this works,” the test either passed or it didn’t. That’s where reality entered the loop. A second model that hallucinates differently is not a substitute for a runtime that can’t hallucinate at all.

So where is the leverage? Models improve on their own — a new checkpoint lands every few months, for free, while you sleep. The harness is the part you build, and it’s where your engineering compounds. Bet your time there.

Part 2 — Anatomy of the harness

Five pieces did the actual work. None of them are glamorous.

1. A communication channel with ground truth outside the agents

The channel was a Redis Streams log, and the design choices inside it were the ones that mattered. Every message is one append (XADD) to an ordered, durable stream. Each agent keeps a cursor — its own read offset — so it always knows what it has and hasn’t seen. And the entire history can be replayed on demand (XREVRANGE) for audit or recovery.

That replay mattered most under compaction. Once the conversation got long enough, neither agent could recall the details of turn 20 from its own context window. But the stream could replay it verbatim. The log — not either agent’s memory — was the authoritative record. (When Redis itself is down, the channel quietly degrades to a flat file; the crude option never fully goes away, it just becomes the safety net.)

The principle: an agent’s context window is RAM — fast, finite, and wiped under pressure. Your durable log is disk. Externalize state. Anything you’ll need after compaction must live somewhere the model doesn’t have to remember — and an ordered, replayable log is the cleanest place to put it.

2. A wake-up trigger — how the agent knows it’s its turn

This is the least discussed and most load-bearing part of any agent harness. An agent that can’t tell when to act isn’t an agent; it’s a function waiting to be called.

The mechanism here was a blocking read: a small watcher loops on XREAD with a short block, and the instant a new entry lands on the stream it wakes the agent to take its turn. That’s the whole trigger. There’s no single right answer for this — pure event-driven (a filesystem or socket watcher) is lowest-latency but has the most moving parts; pure interval-polling is dead simple but laggy; a blocking long-poll on the stream sits neatly in between, which is why it was the pragmatic pick. The point is that the trigger is a real component you design, not an afterthought.

3. Context as a managed budget, not infinite memory

Token cost grew linearly with the conversation. Every single turn re-read the entire growing history, so turn 90 was dragging the weight of turns 1–89 behind it.

The only fix was rotation: every ~15 turns, archive the old transcript, keep a summary, continue on a trimmed log. Not a nice-to-have — mandatory. And rotation has a cost of its own: compaction loses detail. After an archive, both agents had only the summary of the rotated turns, not the verbatim exchange. (Even the stream is bounded — it keeps roughly the last ten thousand entries and trims the rest. An eviction policy on the ground truth itself.)

The principle: treat context like a cache with an eviction policy. You don’t get to keep everything. The harness has to decide, deliberately, what to forget — because if you don’t decide, linear cost decides for you.

4. A verification loop that forces tool calls

Left to generate freely, both agents hallucinated. One miscounted items in its own output and stated the wrong number confidently. The other asserted an “industry-best-practice” that didn’t survive a second look. Neither error came from a bad model; both came from a model being allowed to assert without checking.

The fix is not a better prompt. It’s structurally forcing the agent to run the command, read the file, execute the test — to show its work through a tool call. A claim backed by a passing test is evidence. A claim backed by fluent prose is a guess that happens to read well.

This is the single highest-leverage thing in the whole harness: pair the probabilistic generator with a deterministic oracle, and make the generator route its claims through it. (The oracle here was a Redis container running the demos. Redis ended up doing two jobs in this project — backing the demos, and carrying the agents’ messages.)

5. Cross-verification as a first-class pattern

Agent A builds a thing; Agent B runs A’s tests and critiques the result. Institutionalized peer review, baked into the protocol rather than left to goodwill.

It earned its keep. This loop caught a data-drift condition in a trigger that would have silently skipped updates, an O(N) scan that should have been cursor-based, and a time-of-check/time-of-use gap in a lock release. The rule the harness enforced wasn’t “trust the output” — it was “make the other party reproduce it.”

Part 3 — What the harness can’t fix

The honest part, because a harness that only advertises its wins is lying to you.

It doesn’t cure hallucination — it contains it. Pull out the forced tool calls and both models drift straight back to confident guessing. The harness is a leash, not a cure. The tendency is still in the weights; you’ve just made it harder to act on.

Context cost is real and compounds. Linear token growth plus lossy compaction means long sessions are genuinely expensive and genuinely forgetful. No amount of clever harness design makes context free — it only makes the loss deliberate instead of accidental.

Two models are not two architectures. Same transformer paradigm, same blind spots. The useful diversity in this session came from tools, not from the second LLM. Don’t staff a second model and expect it to catch what neither model can see — that’s a category of error a deterministic checker catches and a peer model does not.

The harness can hide failure if you let it. The most trustworthy thing this session did was log its own fail-then-fix moments — the test that failed before it passed, the claim that got walked back. Build that honest accounting in on purpose. A harness that surfaces only successes will eventually ship a confident, well-formatted mistake.

What I’d tell someone building agents

Spend your engineering on the harness, not on prompt-wrangling the model. The model is a commodity that gets better while you sleep; the harness is the part you own.
Externalize state. The durable log outlives the context window. Anything that must survive compaction belongs on disk, not in the prompt.
Force tool calls for anything that matters. A claim without a tool call is a guess. Route it through something that can’t hallucinate.
Pair every agent with a deterministic oracle. A test runner, a type checker, a real database — that’s where bug-catching actually happens.
Build honest accounting in. Make the harness surface its own failures, or it will quietly bury them.

Closing

The hype frame is “look, two AIs collaborated for ten hours.” The engineering frame is: a Redis Streams log, a blocking read, a rotation policy, and a test runner did the heavy lifting, and the models were nearly interchangeable parts inside that frame.

If you’re building with agents, that’s the good news. The leverage isn’t locked inside a vendor’s weights you can’t touch. It’s in the scaffolding you design, deploy, and improve yourself. The model is the engine — but the harness is the product.