OpenAI dropped Whisper as open source in 2022 and it quietly changed what local transcription could look like. Before that, you were either paying for a cloud API (Otter, Rev, Deepgram) or using open-source tools that were honestly pretty bad at accented speech, technical terms, or anything with background noise. Whisper sits in a completely different tier.
It runs on your machine. No API costs, no data leaving your system, no rate limits.
What you need before starting
Python 3.8 or higher. A video file. Either an NVIDIA GPU with CUDA support, or patience - CPU-only mode works fine for shorter clips, just not quickly.
If you have a GPU with 4GB+ of VRAM, you'll get 10-20x faster processing than CPU mode. For a 30-minute video, that's the difference between 3 minutes and an hour. Worth knowing before you kick off a long job and walk away expecting it to be done.
Installation
One pip command covers most of it:
pip install openai-whisper
Whisper also needs ffmpeg to extract audio from video files. Ubuntu/Debian: sudo apt install ffmpeg. Mac via Homebrew: brew install ffmpeg. Windows: download the binary and add it to your PATH.
The ffmpeg step is the one people always forget. The error it throws when ffmpeg is missing is vague enough that you'll spend 20 minutes looking in the wrong place. Just install it first.
Picking the right model size
Five sizes: tiny, base, small, medium, large. Speed goes down, accuracy goes up.
For most English content with clean audio, small is the sweet spot. Fast enough to be practical, accurate enough that you won't spend your time correcting obvious mistakes. Bump to medium for heavy accents, technical jargon, or videos with real background noise. large is for when accuracy matters above everything else - multilingual content, noisy live recordings, that kind of thing - and you have the compute to justify it.
tiny and base are fast but drop accuracy noticeably on anything outside clean, native-speaker English. Good for quick checks. Not for transcripts you'll actually use.
Running it from the command line
This is genuinely all you need:
whisper your_video.mp4 --model small
Whisper pulls the audio, auto-detects the language, and writes three output files to the same directory: a .txt with the raw transcript, a .srt for subtitle use, and a .json with full segment data including timestamps. All in one shot.
Language detection is good. Feed it a Spanish video without specifying the language and it gets it right nearly every time. Genuinely mixed-language content is harder - results vary. If you already know the language, specifying it explicitly gets you slightly better results:
whisper your_video.mp4 --model medium --language Spanish
Using Whisper from Python
The CLI is fine for one-off jobs. The Python API is what you want for batch processing or wiring transcription into a larger pipeline:
import whisper
model = whisper.load_model("small")
result = model.transcribe("your_video.mp4")
print(result["text"]) # full transcript as a string
# Segment-by-segment with timestamps
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"]
print(f"[{start:.1f}s - {end:.1f}s] {text}")
The result["segments"] list is where the real value is. Each segment has a start time, end time, and the transcribed text. That's what you need for subtitle generation, syncing text to video, or time-coded search and analysis.
Long videos and faster alternatives
Whisper chunks audio internally into 30-second windows, so long videos aren't a problem technically. Just time. A 60-minute video on a mid-range GPU with the small model takes roughly 6-10 minutes. CPU-only, budget 45-90 minutes for the same file.
Processing lots of long videos? Look at faster-whisper. It's a reimplementation using CTranslate2 that runs 2-4x faster than the original with the same accuracy. Mostly drop-in compatible with the same Python API:
pip install faster-whisper
Adding subtitles to downloaded videos
If you've downloaded a video and want to add subtitles - for translation, accessibility, or repurposing content - Whisper's SRT output is your starting point. Run it, get the .srt, edit timing in Aegisub (free, solid), and you're done. Or feed it into a translation step.
There's an underused flag worth knowing: --task translate doesn't just transcribe - it translates directly to English from any supported language in one pass:
whisper your_video.mp4 --model medium --task translate
Quality is solid for most European languages, decent for Arabic, Hindi, and Japanese. Not perfect. Usable for personal reference or rough drafts you'll clean up anyway.
Where Whisper still falls short
Heavy background music under speech is a real problem. When music and voice share similar frequency ranges and the music is loud, Whisper gets confused and produces garbled output on those sections. Crosstalk between speakers causes it to merge or skip lines. Very fast speech - auctioneers, some rap styles, speed-talkers - gives inconsistent results even on the large model.
For clean talking-head content, interviews, tutorials, and lectures it's excellent. For music videos, live events with crowd noise, or anything recorded in a loud room on a phone - keep your expectations in check.
One last practical thing: always look at the very start and end of your transcript. Whisper occasionally hallucinates a sentence when there's a long silence before speech begins. Start your audio at second zero and this happens much less often.
Run it on something you actually need transcribed. The gap between what cloud transcription services cost over a year and what Whisper costs is large enough that the ten-minute setup pays for itself immediately.