toolsJuly 4, 20264 min read

I built a podcast production line that pauses exactly once

A voice memo goes in; an edited, captioned, rule-checked episode waits at a human gate. A build log on the new episode production line: an AI editor hired onto a one-episode contract, a free ffmpeg-and-Whisper floor standing behind it, two QA rungs, and a trust posture that never believes an unsigned callback. Plus what a line like this is for, and when you should not build one.

Publishing a podcast episode by hand is about eleven chores pretending to be one. Record, level the audio, cut the filler words, transcribe, caption, run the checks, upload, stamp the feed, cut the promo clips. This week I finished a production line that does the middle of that list from a single voice memo, and it pauses exactly once: to ask a human whether the episode is any good. Here is how it works, what it costs to run, and how to tell whether you need one.

machine stepAI passhuman gatenot built yet

The line, with the site's usual notation: squares are system steps, the circle is the human gate. Blue steps are deterministic machine work, the yellow step is the AI edit pass, and the dashed stops are phase 3, not built yet.

Eight stops, one pause

The first stop takes whatever the phone recorded and makes it broadcast-shaped. It validates the file, then normalizes loudness to the standard podcast window; my 26-minute test recording landed at −19.45 LUFS against a −19 target, which is the kind of number an app like Overcast expects and a laptop microphone never produces on its own.

The second stop is the interesting one: an AI editor removes the ums, cleans up the room tone, and tightens any silence longer than two seconds. More on its employment terms below.

Captions come from Whisper, a speech-to-text model that runs on my own build machines. A 30-minute episode transcribes in about seven minutes of free CI time, and the result becomes the caption files players use, not just a transcript in a drawer.

Then two layers of quality assurance. A deterministic layer re-measures the encoded audio (spec, loudness window, caption coverage) and fails loudly on any miss. A judgment layer has a language model read the full transcript against my published-content rules and the episode brief, and it is built to overblock: anything malformed or borderline parks the episode rather than passing it.

Only then does a person enter. I listen to the edited cut against the raw one and approve or reject. Nothing in this line publishes anything on its own; the gate is the product, not a compliance sticker.

The editor is on a one-episode contract

The edit pass runs on Descript, a hosted AI audio editor, at $24 a month. Before spending a dollar I built the fallback: an editing floor made of ffmpeg and Whisper that costs nothing and will run forever. The vendor only earns its seat if the edited cut beats the raw cut at the human gate on a real episode. If it loses, the subscription dies and the floor takes over the stop. Same shape either way, one swapped part.

The edit happens in a vendor's cloud, but the recipe that made it lives in my repo.

That recipe is a versioned prompt committed next to the code: remove filler, apply studio sound, tighten silences, and an explicit prohibition on rewriting, reordering, or synthesizing any speech. The editor tightens the episode; it does not get to change what I said. Every job also reports its own meter reading (media minutes and AI credits consumed), which the pipeline writes into a per-episode manifest, so the whole lane stays inside a $50-a-month ceiling I can audit from the repo.

phase 0PASS

Verify first

read the docs, benchmark, price it. No code.

phase 1PASS

The floor

ffmpeg + Whisper. Works with zero vendors, forever.

$0/mo

phase 2GATE OPEN

The vendor trial

the AI editor, hired onto a one-episode contract.

$24/mo

phase 3NEXT

Publish + clips

feed enclosure live, three clips drafted behind gates.

$0 extra

new monthly spend$24 of a $50 ceiling

the yellow bar is the editor. If it fails the trial, the bar goes to zero and the line keeps running.

The build itself ran the same way the pipeline runs: each phase behind a gate. Phase 0 verified the vendor's docs and pricing before any code existed; phase 2's gate stays open until the trial episode renders a verdict.

Trust the vendor, not its doorbell

One design decision is worth pulling out, because it applies to any pipeline that waits on someone else's cloud. When the editor finishes a job it can ring you back with a callback, a small automatic message saying done. Descript's callbacks are unsigned, which means anyone who guesses the address could ring the same bell. So the pipeline treats a callback as a doorbell and nothing more: it may wake the runner up, but the only thing that completes a job is the pipeline calling the vendor's API itself, with its own credentials, and reading the answer.

dashed = anyone could have sent itsolid = authenticated, and the only path the pipeline acts on

The unsigned callback stops at the trust boundary. The authenticated round-trip is the only path the pipeline acts on.

What a line like this is for

Strip away the podcast specifics and the pattern is: a recurring recording goes in, a checked, captioned, publishable artifact comes out, and a human approves each unit. That fits more workflows than mine. A weekly show, obviously. Course lectures becoming captioned modules. A church or meetup publishing every talk without volunteering someone into an editing job. Client interviews turning into searchable, quotable transcripts. Internal briefings that people can actually listen to. The stops change; the shape (machine steps, one AI pass on a fixed recipe, one human gate) carries over.

What I am using it for

This line exists to feed The Legibility Desk, my critique podcast, where every episode is already a cited, verifiable page and the audio has been the missing half. It sits alongside Between Systems, the interview show, and everything it produces eventually joins the site radio, the continuous player over this site's own narrated audio. The pipeline itself follows the same house pattern as the workflow engine I wrote about in June: AI does the volume, gates do the judgment, and the human stays load-bearing.

When I would not build this

If you publish occasionally rather than on a cadence, skip all of it and edit by hand in any decent app; a production line pays for itself through repetition or not at all. If you do not already have somewhere for files to live and jobs to run, a hosted end-to-end tool is the honest choice, because the storage and the runners are most of the plumbing. And if your show is multi-speaker interviews, this exact line rejects your files on purpose; separating voices is a different problem, and mine is deferred until the solo line has earned its keep.

If your team has a pile of recordings and a publishing chore, this is a thing I build. Email me at jake@jakelawrence.xyz with what goes in and what should come out, and I will tell you honestly whether you need a pipeline or an afternoon of ffmpeg.

Experience it yourselfEmail me about building yours

ShareX LinkedIn Hacker News Email

Get the next one

An occasional note when something genuinely new ships here — essays, free tools, projects. No schedule, no filler, easy out.

Need something like this built?

I design and ship AI tools, full-stack apps, and data pipelines — end to end, to production. Tell me the problem in a sentence; I'll give you an honest read on fit within a day.

Work with me →