Looking at OpenAI Whisper? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

OpenAI Whisper

by OpenAI

The reference Whisper from OpenAI — weak-supervised multilingual ASR, MIT-licensed, 99 languages. Six model sizes plus the new large-v3-turbo. Pick a size, copy a recipe.

TL;DR

OpenAI's openai-whisper Python package is the canonical reference implementation of Whisper — a single large encoder-decoder transformer in PyTorch, weak-supervised on 680,000 hours of multilingual audio (paper: Radford et al., 2022). Ships six base checkpointstiny (39M), base (74M), small (244M), medium (769M), large (1.55B), and the newer turbo (809M, ~8× faster than large) — and supports 99 languages for transcription plus translation-to-English.

Best for research baselines, anyone learning the ASR space, and quick local experiments where dependency surface matters more than throughput. For production speed pick a derivative — faster-whisper (CTranslate2, ~4× faster) or whisper.cpp (C/C++, fastest on Apple Silicon and CPU). License: MIT on both code and weights. Newest checkpoint is large-v3-turbo — 4 decoder layers instead of 32, near-large-v3 quality at a fraction of the wall-clock.

Category
Open source
License
MIT
Stars
★ 98.1k
Last push
2026-04-15
Pricing
free
Platforms
Linux, macOS, Windows, GPU

What it is

Whisper is OpenAI's flagship speech-to-text model, trained on 680,000 hours of multilingual audio. The original Python package is the reference implementation — easy to install, but intentionally simple. For production speed you almost always want a derivative (whisper.cpp, faster-whisper, whisperX) that wraps the same weights in a faster runtime. Free, MIT-licensed, runs on CPU or GPU.

Best for: Research, baseline accuracy, teams that want the canonical reference implementation.
Watch out for: Pure-Python inference is slow on CPU; no built-in speaker diarization; batch-only, no streaming.

Install / use

pip install -U openai-whisper

Pick a model · 6 checkpoints

openai-whisper ships six base model sizes. Bigger = more accurate, more VRAM, slower. The table below mirrors the canonical README. Each card links the GitHub README anchor or the official Hugging Face model card — never a direct weights download. The Python package fetches weights on first use.

tiny
Smallest · ~10× speed · edge / smoke test

39M parameters. The fastest checkpoint and the easiest to run on a CPU laptop or edge box. WER trails the larger models substantially — fine for smoke tests, language ID, or short-clip preview. A separate tiny.en English-only variant ships alongside.

tiny
39M params · ~1 GB VRAM · ~10× relative speed vs large · multilingual + tiny.en variant
base
~7× speed · usable CPU default

74M parameters. The lowest checkpoint that produces broadly usable English transcripts on clean audio. Still fits inside ~1 GB VRAM and runs comfortably on consumer CPUs. A base.en English-only variant is also published.

base
74M params · ~1 GB VRAM · ~7× relative speed vs large · multilingual + base.en variant
small
~4× speed · CPU sweet spot

244M parameters. The point at which English WER on clean audio gets close to medium for most use cases. Comfortable on a 2 GB GPU or a modern CPU. A small.en English-only variant is published alongside.

small
244M params · ~2 GB VRAM · ~4× relative speed vs large · multilingual + small.en variant
medium
~2× speed · accuracy/VRAM balance

769M parameters. Strong multilingual accuracy on a 6–8 GB GPU. Common default for desktop transcription where large won't fit alongside the rest of the stack. A medium.en English-only variant is published alongside.

medium
769M params · ~5 GB VRAM · ~2× relative speed vs large · multilingual + medium.en variant
large-v3
Flagship accuracy · 99 languages

1.55B parameters. The full large encoder-decoder — best WER in the family and the baseline most papers compare against. Needs ~10 GB VRAM at fp16. Reference card: openai/whisper-large-v3 on Hugging Face.

large-v3
1.55B params · ~10 GB VRAM · 1× relative speed (baseline) · 99 languages · multilingual only
large-v3-turbo
Newer · ~8× speed at near-large quality

809M parameters. A 2024 release derived from large-v3 by collapsing the decoder from 32 layers to 4. Roughly 8× faster than large at near-equivalent transcription quality. Note: not trained for translation. Reference card: openai/whisper-large-v3-turbo on Hugging Face.

turbo
809M params · ~6 GB VRAM · ~8× relative speed · 99 languages · transcription only (no translate task)
Selecting at runtime: pass the model name as the first arg to whisper.load_model("turbo") or the --model CLI flag. Weights download from OpenAI's CDN on first use and cache under ~/.cache/whisper. For production throughput on the same weights see faster-whisper (CTranslate2) or whisper.cpp (C/C++).

Setup recipes · pick one and copy

Three runnable configurations from the openai/whisper README. Each block runs against the latest tagged release (v20250625 or later) and Python ≥3.9.

1Install + first transcript (CLI)

pip install · ffmpeg required · single-command transcription with SRT output. The default starting point.

# openai-whisper (latest, v20250625+) · Python >=3.9
# 1) ffmpeg is a hard runtime dep — install it first.
#    macOS:    brew install ffmpeg
#    Ubuntu:   sudo apt update && sudo apt install ffmpeg
#    Windows:  choco install ffmpeg     (or scoop install ffmpeg)

pip install -U openai-whisper

# 2) Transcribe a file directly to SRT subtitles.
whisper input.mp3 --model turbo --output_format srt

# Other useful flags:
#   --language Japanese        force a language (skip auto-detect)
#   --task translate           translate to English (not supported by 'turbo')
#   --device cuda              run on GPU
#   --output_format all        write txt + srt + vtt + tsv + json
2Python API + word timestamps

load_model() · transcribe() with word_timestamps=True · iterate segments and per-word timings.

# Python >=3.9, openai-whisper installed.
import whisper

model = whisper.load_model("turbo")   # or tiny / base / small / medium / large-v3

result = model.transcribe(
    "input.mp3",
    word_timestamps=True,
    # language="en",                 # skip auto-detect if you know it
    # task="translate",              # translate to English (not on 'turbo')
)

print(result["text"])
for seg in result["segments"]:
    print(f"[{seg['start']:6.2f} -> {seg['end']:6.2f}] {seg['text']}")
    for w in seg.get("words", []):
        print(f"   {w['start']:6.2f}-{w['end']:6.2f}  {w['word']!r}")
Source: openai/whisper · Python usage ↗. Word-timestamp accuracy drifts on long-form audio — for tighter per-word timing see WhisperX (forced alignment).
3GPU + half-precision

device="cuda" + fp16=True · the production knobs on the reference implementation. Cuts VRAM and wall-clock roughly in half on supported GPUs.

# CUDA toolkit + a matching torch build already installed.
import whisper

model = whisper.load_model("large-v3", device="cuda")

result = model.transcribe(
    "podcast.wav",
    fp16=True,                 # default True on CUDA; explicit here for clarity
    beam_size=5,
    best_of=5,
    temperature=0.0,
    # condition_on_previous_text=False,  # try if you see hallucination loops
)

print(result["text"][:500])
Reference: transcribe.py options ↗. For ~4× faster GPU inference on the same weights, switch to faster-whisper.

Features

Speaker diarizationNo
Word-level timestampsYes
Streaming / real-timeNo
Languages supported99
HIPAA eligibleNo

Links

OpenAI Whisper vs Whipscribe

FeatureOpenAI WhisperWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsYesYes
StreamingNoNo
Languages9999
PlatformsLinux, macOS, Windows, GPUWeb, API, MCP

Alternatives to OpenAI Whisper

Frequently asked about OpenAI Whisper

Is OpenAI Whisper free?

Yes. The openai-whisper package and all released model weights are MIT-licensed, free for commercial and non-commercial use. Inference cost is whatever hardware you run it on.

Does Whisper support speaker diarization?

No. Vanilla Whisper outputs text + segment timestamps but does not label speakers. To get 'who said what,' pair it with a diarization library (e.g. pyannote) or use whisperX, which bundles both.

What's the difference between Whisper and faster-whisper?

Same underlying model weights; different runtime. faster-whisper uses CTranslate2 and is roughly 4x faster on GPU with lower VRAM use. Accuracy is essentially identical. For production, faster-whisper is usually the better choice.

Can Whisper run on CPU?

Yes, but it's slow. Real-time factor on a modern laptop CPU with the large-v3 model is well below 1x. For CPU-bound workloads, whisper.cpp is dramatically faster than the reference Python implementation.

Does Whisper produce word-level timestamps?

The reference implementation has a word_timestamps flag, but timings can drift on long-form audio. For more accurate per-word timing, use whisperX (forced alignment) or stable-ts.

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.