Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
OpenAI Whisper
The reference Whisper from OpenAI — weak-supervised multilingual ASR, MIT-licensed, 99 languages. Six model sizes plus the new large-v3-turbo. Pick a size, copy a recipe.
OpenAI's openai-whisper Python package is the canonical reference implementation of Whisper — a single large encoder-decoder transformer in PyTorch, weak-supervised on 680,000 hours of multilingual audio (paper: Radford et al., 2022). Ships six base checkpoints — tiny (39M), base (74M), small (244M), medium (769M), large (1.55B), and the newer turbo (809M, ~8× faster than large) — and supports 99 languages for transcription plus translation-to-English.
Best for research baselines, anyone learning the ASR space, and quick local experiments where dependency surface matters more than throughput. For production speed pick a derivative — faster-whisper (CTranslate2, ~4× faster) or whisper.cpp (C/C++, fastest on Apple Silicon and CPU). License: MIT on both code and weights. Newest checkpoint is large-v3-turbo — 4 decoder layers instead of 32, near-large-v3 quality at a fraction of the wall-clock.
What it is
Whisper is OpenAI's flagship speech-to-text model, trained on 680,000 hours of multilingual audio. The original Python package is the reference implementation — easy to install, but intentionally simple. For production speed you almost always want a derivative (whisper.cpp, faster-whisper, whisperX) that wraps the same weights in a faster runtime. Free, MIT-licensed, runs on CPU or GPU.
Watch out for: Pure-Python inference is slow on CPU; no built-in speaker diarization; batch-only, no streaming.
Install / use
pip install -U openai-whisper
Pick a model · 6 checkpoints
openai-whisper ships six base model sizes. Bigger = more accurate, more VRAM, slower. The table below mirrors the canonical README. Each card links the GitHub README anchor or the official Hugging Face model card — never a direct weights download. The Python package fetches weights on first use.
39M parameters. The fastest checkpoint and the easiest to run on a CPU laptop or edge box. WER trails the larger models substantially — fine for smoke tests, language ID, or short-clip preview. A separate tiny.en English-only variant ships alongside.
39M params · ~1 GB VRAM · ~10× relative speed vs large · multilingual + tiny.en variant
74M parameters. The lowest checkpoint that produces broadly usable English transcripts on clean audio. Still fits inside ~1 GB VRAM and runs comfortably on consumer CPUs. A base.en English-only variant is also published.
74M params · ~1 GB VRAM · ~7× relative speed vs large · multilingual + base.en variant
244M parameters. The point at which English WER on clean audio gets close to medium for most use cases. Comfortable on a 2 GB GPU or a modern CPU. A small.en English-only variant is published alongside.
244M params · ~2 GB VRAM · ~4× relative speed vs large · multilingual + small.en variant
769M parameters. Strong multilingual accuracy on a 6–8 GB GPU. Common default for desktop transcription where large won't fit alongside the rest of the stack. A medium.en English-only variant is published alongside.
769M params · ~5 GB VRAM · ~2× relative speed vs large · multilingual + medium.en variant
1.55B parameters. The full large encoder-decoder — best WER in the family and the baseline most papers compare against. Needs ~10 GB VRAM at fp16. Reference card: openai/whisper-large-v3 on Hugging Face.
1.55B params · ~10 GB VRAM · 1× relative speed (baseline) · 99 languages · multilingual only
809M parameters. A 2024 release derived from large-v3 by collapsing the decoder from 32 layers to 4. Roughly 8× faster than large at near-equivalent transcription quality. Note: not trained for translation. Reference card: openai/whisper-large-v3-turbo on Hugging Face.
809M params · ~6 GB VRAM · ~8× relative speed · 99 languages · transcription only (no translate task)
whisper.load_model("turbo") or the --model CLI flag. Weights download from OpenAI's CDN on first use and cache under ~/.cache/whisper. For production throughput on the same weights see faster-whisper (CTranslate2) or whisper.cpp (C/C++).Setup recipes · pick one and copy
Three runnable configurations from the openai/whisper README. Each block runs against the latest tagged release (v20250625 or later) and Python ≥3.9.
pip install · ffmpeg required · single-command transcription with SRT output. The default starting point.
# openai-whisper (latest, v20250625+) · Python >=3.9
# 1) ffmpeg is a hard runtime dep — install it first.
# macOS: brew install ffmpeg
# Ubuntu: sudo apt update && sudo apt install ffmpeg
# Windows: choco install ffmpeg (or scoop install ffmpeg)
pip install -U openai-whisper
# 2) Transcribe a file directly to SRT subtitles.
whisper input.mp3 --model turbo --output_format srt
# Other useful flags:
# --language Japanese force a language (skip auto-detect)
# --task translate translate to English (not supported by 'turbo')
# --device cuda run on GPU
# --output_format all write txt + srt + vtt + tsv + json
load_model() · transcribe() with word_timestamps=True · iterate segments and per-word timings.
# Python >=3.9, openai-whisper installed.
import whisper
model = whisper.load_model("turbo") # or tiny / base / small / medium / large-v3
result = model.transcribe(
"input.mp3",
word_timestamps=True,
# language="en", # skip auto-detect if you know it
# task="translate", # translate to English (not on 'turbo')
)
print(result["text"])
for seg in result["segments"]:
print(f"[{seg['start']:6.2f} -> {seg['end']:6.2f}] {seg['text']}")
for w in seg.get("words", []):
print(f" {w['start']:6.2f}-{w['end']:6.2f} {w['word']!r}")
device="cuda" + fp16=True · the production knobs on the reference implementation. Cuts VRAM and wall-clock roughly in half on supported GPUs.
# CUDA toolkit + a matching torch build already installed.
import whisper
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe(
"podcast.wav",
fp16=True, # default True on CUDA; explicit here for clarity
beam_size=5,
best_of=5,
temperature=0.0,
# condition_on_previous_text=False, # try if you see hallucination loops
)
print(result["text"][:500])
Features
| Speaker diarization | No |
| Word-level timestamps | Yes |
| Streaming / real-time | No |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- openai/whisper ↗ ↗main repo · canonical README, model table, CLI + Python usage · MIT-licensed code and weights
- Whisper paper · arXiv 2212.04356 ↗ ↗Radford et al. 2022 · Robust Speech Recognition via Large-Scale Weak Supervision · 680K hours of multilingual audio
- huggingface.co/openai/whisper-large-v3 ↗ ↗official large-v3 model card · 1.55B params · 99 languages · usage with HF transformers pipeline
- huggingface.co/openai/whisper-large-v3-turbo ↗ ↗official turbo model card · 809M params · 4 decoder layers vs 32 in large-v3 · transcription-only
- openai/whisper · Discussions ↗ ↗active community Q&A · pinned threads cover streaming, long-form alignment, language tips
- HF transformers · Whisper docs ↗ ↗alternative runtime · pipeline('automatic-speech-recognition', model='openai/whisper-large-v3-turbo') · supports Flash Attention 2 + torch.compile
- SYSTRAN/faster-whisper ↗ ↗CTranslate2 reimplementation of the same weights · ~4× faster on GPU · the production swap-in for openai-whisper
- ggml-org/whisper.cpp ↗ ↗C/C++ port · zero Python deps · fastest CPU and Apple Silicon path on the same Whisper weights
OpenAI Whisper vs Whipscribe
| Feature | OpenAI Whisper | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | Yes | Yes |
| Streaming | No | No |
| Languages | 99 | 99 |
| Platforms | Linux, macOS, Windows, GPU | Web, API, MCP |
Alternatives to OpenAI Whisper
Frequently asked about OpenAI Whisper
Is OpenAI Whisper free?
Yes. The openai-whisper package and all released model weights are MIT-licensed, free for commercial and non-commercial use. Inference cost is whatever hardware you run it on.
Does Whisper support speaker diarization?
No. Vanilla Whisper outputs text + segment timestamps but does not label speakers. To get 'who said what,' pair it with a diarization library (e.g. pyannote) or use whisperX, which bundles both.
What's the difference between Whisper and faster-whisper?
Same underlying model weights; different runtime. faster-whisper uses CTranslate2 and is roughly 4x faster on GPU with lower VRAM use. Accuracy is essentially identical. For production, faster-whisper is usually the better choice.
Can Whisper run on CPU?
Yes, but it's slow. Real-time factor on a modern laptop CPU with the large-v3 model is well below 1x. For CPU-bound workloads, whisper.cpp is dramatically faster than the reference Python implementation.
Does Whisper produce word-level timestamps?
The reference implementation has a word_timestamps flag, but timings can drift on long-form audio. For more accurate per-word timing, use whisperX (forced alignment) or stable-ts.
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.