I let a rival AI review my site, then shipped every fix in a day
essaysJuly 4, 20265 min read
By

I let a rival AI review my site, then shipped every fix in a day

I gave this site's own documentation to a rival AI and told it to find what's wrong. One day later: ten verified issues, nine pull requests, and a set of permanent drift guards, with a human behind every merge. Here's the whole process, with diagrams, and how to use it on your own project.

On the Fourth of July I handed this site's own documentation to a rival AI and asked it to tear the place apart. Twenty-four hours later, every finding that survived verification had been filed, built, tested, and merged: ten issues, nine pull requests, and a new set of permanent guards that keep the site honest after everyone stops looking. This post walks through the whole process with diagrams: how an outside review becomes shipped code, why the loop is safe to run with AI agents doing the heavy lifting, what you could use it for, and what I'll be using it for next.

Start with an outsider who owes you nothing

This site maintains a capabilities series: eight chapters of documentation describing what the system can do, how it's secured, and where the known gaps are. Documentation written by insiders inherits insider blind spots, so I gave the entire series to a different AI model with one instruction: review this like a skeptical engineer who has never seen the codebase.

It came back with findings graded by severity: real gaps (no backup story, no spend meter), risks (one automation path that can merge code without a human), and a stack of recommendations. Some were sharp. Some were confidently wrong. Which is why the next step matters more than the review itself.

Verify before you believe

Every claim in the review was checked against the actual code before anything got filed. A review is an opinion until the tree confirms it, and roughly a third of the findings needed correcting: features the reviewer thought were missing that already existed, risks that were already fenced off by tests it hadn't read.

The corrected review, annotated with what held up and what didn't, was committed to the repo as a permanent record. Then each surviving finding became a GitHub issue, all of them hanging off one tracker issue that doubles as the sprint's checklist. From that point the work is boring on purpose: pick an issue, build it, ship it, tick the box.

the loopmerged code is thenext review's inputreviewoutside AI readsverifycheck the treeissuestracker + specsbuildone issue, one PRgatestests + guardsmergea human decides
system stephuman gateone lap of the highlight = one shipped fix
One lap of the loop: an outside review is verified against the real code, becomes tracked issues, each issue becomes a single pull request, the gates run, and a human merges. The merged code is what the next review reads, so every lap starts from a stronger baseline.

One issue, one pull request, one human click

Build discipline carried the day: each pull request holds exactly one concern, opened only after the previous one merged, so review stays possible and any single change can be reverted without untangling the rest. Every PR runs the full test suite, the linter, a production build, and a mobile playtest gate before a human ever looks at it. And a human merges every single one.

What actually shipped in those nine PRs: an AI spend ledger with a $250 monthly budget brake that pauses agents at the limit, encrypted weekly backups that prove the restore into a scratch database (a backup that can't restore is not a backup), a data retention sweep that deletes third parties' information on a schedule, a security chapter consolidating the threat model, and a rate-limit audit that classified all 463 write endpoints on the site with zero left unaccounted for.

The part that outlives the sprint

The real payoff isn't the fixes. It's that most findings were converted into drift guards: small scanners paired with committed ledgers that fail the test suite the moment code and documentation disagree again. The docs' factual claims are now pinned to markers the test suite re-checks. Every credential has to be classified in a ledger or the build fails. Every write endpoint has to be rate-limited, auth-gated, signature-verified, or carry a written justification, or the build fails.

The principle: the fix is not the pull request. The fix is the test that fails if the problem ever comes back.

a changedrift guardsledgers that fail the test suite when code and docs disagreesecrets audit · SQL audit · docs facts · rate-limit maptests + CIevery pull request runs the whole suite before anyone looks at it6,800+ tests · lint · prod build · mobile playtesthard limitsboundaries that hold even if everything above misjudgedprotected file paths · monthly AI budget brakea human approvesnothing publishes itself - merge and publish are hand clicksthe merge button · draft gates on outbound contentfails loudlyproduction
system layerhuman gatedashed exits = where a bad change gets caught
Why it's safe to let agents do the heavy lifting: every change falls through the same four layers. Drift guards catch dishonesty, CI catches breakage, hard limits hold even when judgment fails, and a human makes the final call. A bad change doesn't fail quietly; it bounces out at whichever layer caught it.
1
outside review
a rival AI read the docs cold
10
issues filed
every finding verified in-tree first
9
pull requests
one concern each, human-merged
6,800+
tests green
the suite runs on every single PR
463
write routes audited
zero left unclassified
$250
monthly AI budget
a brake pauses spend at 100%
The sprint by the numbers. One outside review became ten verified issues and nine single-concern pull requests, with the full suite green on every one.

What a loop like this is for

The pattern generalizes well beyond one person's website:

  • Teams whose docs drift from their code. The review finds the drift once; the guards keep finding it forever.
  • Solo builders shipping with AI agents. You get agent speed without giving up control, because the gates are structural, not aspirational.
  • Audits that normally rot in a wiki. Security posture, rate limiting, data retention, backup readiness: each became a living audit here, re-run on every commit.
  • Any codebase inheriting a new owner. An outside review plus in-tree verification is a fast, honest map of what you actually bought.

It's also model-agnostic. Any strong reviewer works, human or AI; the value is in the verify step and the gates, not in which brain wrote the critique.

What I'll be using it for

Here, the loop becomes a standing quarterly review: the capabilities series gets handed to an outside model, the findings get verified and filed, and the drift guards accumulated from each lap make the next one shorter. The budget brake now governs my agent spend every day, and the retention sweeps run weekly whether I remember them or not.

Beyond this site, this is the engagement I run for other people's projects: review, verify, file, build, gate, merge, with the human gates placed wherever you need final say. If your codebase and your documentation have quietly drifted apart, or you want AI agents working on your project without handing them the keys, this loop is built for exactly that. Shoot me an email and tell me what you're working on: jake@jakelawrence.xyz.

Experience it yourselfEmail me about your project
ShareXLinkedInHacker NewsEmail

Get the next one

An occasional note when something genuinely new ships here — essays, free tools, projects. No schedule, no filler, easy out.

Need something like this built?

I design and ship AI tools, full-stack apps, and data pipelines — end to end, to production. Tell me the problem in a sentence; I'll give you an honest read on fit within a day.

Work with me →