Automating podcast transcripts on my Mac with OpenAI Whisper
A while ago, David Smith created a site called Podsearch, a search engine for a few of his favorite podcasts, including a couple of mine. That project went by the wayside after a while, and I found myself getting frustrated during episodes of Upgrade that I couldn’t refer people back to specific episodes where we had already discussed a topic.
About the same time, I began reading about OpenAI Whisper, an automatic speech recognition system that “approaches human level robustness and accuracy” for converting the spoken word into written text. Up until then, I’d been doing speech-to-text—most notably, for my transcripts of Apple results calls using various services (Trint, Rev) that charge by the minute.
Whisper’s free, and you can run it on your own computer. I thought that I might give Whisper a go in transcribing Upgrade—or at least recent episodes of Upgrade, maybe since episode 400—for my own reference.
I rapidly discovered that while the python implementation of Whisper would run on my Mac, it ran at about 0.5x speed—so a two-hour podcast would take four hours to transcribe. Not great. Still, the results were promising. Here’s the state of the art of podcast transcription circa 2017:
Alright we’re going to wrap it up that this ends this edition of our red chickens with Batman that are affiliated with like extension cords for Batman University I’d like to think my gas for being here and watching some Batman movies with me… and told her I think you were the king of the Wicker people. Goodnight everybody for listening to be uncomfortable I’ve been your Hostess and smell but really I Batman.
And here’s how Whisper fared:
All right, we’re gonna wrap it up. This ends this edition of our check-ins with Batman that are affiliated. It’s like extension course for Batman University. I’d like to thank my guests for being here and watching some Batman movies with me…. And Tony Sindelar, I think you were the king of the Wicker people. Goodbye nerds. And thanks everybody for listening to The Incomparable. I’ve been your host Jason Snell. But really, I’m Batman. Hmm.
While not perfect, Whisper was staggeringly better than the 2017 transcript and really, much better than any other AI-driven transcription I’d tried recently. It got the punctuation. It got proper names. And it didn’t turn “Thanks for listening to The Incomparable, I’ve been your host Jason Snell” into “Goodnight everybody for listening to be uncomfortable, I’ve been your Hostess and smell.”
Fortunately, a fellow named Georgi Gerganov made a C++-native port of Whisper that is easy to install and run on macOS and is optimized for Apple silicon. I downloaded and installed Gerganov’s version, downloaded the medium English model, and discovered that it could transcribe a podcast at rates up to 2x!
This was great, but the last thing I needed was to have to remember all the arcane command-line commands required to get the files in the right place. So instead, I wrote The Transcriptor, a Shortcut that lets me control-click on audio files and turn them into transcripts in a format of my choice. (I also pointed Whisper at an episode of Total Party Kill and it made a remarkably good subtitle track ready for uploading to YouTube.)
Along the way I mentioned what I was doing to David Smith, who sent me his code for PodSearch so I could use it to generate my Upgrade archive. This apparently turned David on to Whisper and he’s since revived the site with Whisper-derived transcripts of seven podcasts, including Upgrade.
Then last week, Apple’s financial results came out. Rather than using Rev, which I had been using to generate and correct transcripts the past few years, I decide to use Whisper and The Transcriptor to do the job.
Other than a few hiccups involving using separate tools to record, transcribe, edit, and play back audio—I need to figure out a more complete workflow there—it worked spectacularly well. Over the years I’ve internalized all the Apple financial analyst call-specific phrases that the AI engine used by Rev would get wrong, which I’d need to correct. Almost all of them were rendered correctly by Whisper! I had to do less to get the transcript in good shape than I ever have before.
This is not to say that web apps like Rev aren’t always seeking better speech-to-text systems, and might even adopt Whisper themselves. And those services add other nice features—like the integration of audio playback and text editing—that definitely make editing a transcript easier than what I did. (I was editing in BBEdit and clicking into Overcast—playing back uploaded MP3 files at 1.5x speeds—when I needed to pause or back up.)
Still… this is amazing. If I have learned anything from this journey, it’s that the ability to generate high-quality, readable transcripts from podcast audio is going to be here soon. It’s not quite here yet—Whisper has quirks that make it better for searchable transcripts than actual reading, and it doesn’t identify speakers—but it’s perilously close now.
While reading a podcast transcript will never be the same as listening to the podcast, providing usable transcripts will make podcast content more accessible, searchable, and able to be referenced. It’s all just around the corner now.