Looking at OpenAI Whisper? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

OpenAI Whisper

Name: OpenAI Whisper
Author: OpenAI

by OpenAI

The reference Whisper from OpenAI — weak-supervised multilingual ASR, MIT-licensed, 99 languages. Six model sizes plus the new large-v3-turbo. Pick a size, copy a recipe.

TL;DR

OpenAI's openai-whisper Python package is the canonical reference implementation of Whisper — a single large encoder-decoder transformer in PyTorch, weak-supervised on 680,000 hours of multilingual audio (paper: Radford et al., 2022). Ships six base checkpoints — tiny (39M), base (74M), small (244M), medium (769M), large (1.55B), and the newer turbo (809M, ~8× faster than large) — and supports 99 languages for transcription plus translation-to-English.

Best for research baselines, anyone learning the ASR space, and quick local experiments where dependency surface matters more than throughput. For production speed pick a derivative — faster-whisper (CTranslate2, ~4× faster) or whisper.cpp (C/C++, fastest on Apple Silicon and CPU). License: MIT on both code and weights. Newest checkpoint is large-v3-turbo — 4 decoder layers instead of 32, near-large-v3 quality at a fraction of the wall-clock.

What it is

Whisper is OpenAI's flagship speech-to-text model, trained on 680,000 hours of multilingual audio. The original Python package is the reference implementation — easy to install, but intentionally simple. For production speed you almost always want a derivative (whisper.cpp, faster-whisper, whisperX) that wraps the same weights in a faster runtime. Free, MIT-licensed, runs on CPU or GPU.

Best for: Research, baseline accuracy, teams that want the canonical reference implementation.
Watch out for: Pure-Python inference is slow on CPU; no built-in speaker diarization; batch-only, no streaming.

Install / use

pip install -U openai-whisper

Pick a model · 6 checkpoints

openai-whisper ships six base model sizes. Bigger = more accurate, more VRAM, slower. The table below mirrors the canonical README. Each card links the GitHub README anchor or the official Hugging Face model card — never a direct weights download. The Python package fetches weights on first use.

tiny

Smallest · ~10× speed · edge / smoke test

39M parameters. The fastest checkpoint and the easiest to run on a CPU laptop or edge box. WER trails the larger models substantially — fine for smoke tests, language ID, or short-clip preview. A separate tiny.en English-only variant ships alongside.

tiny
39M params · ~1 GB VRAM · ~10× relative speed vs large · multilingual + tiny.en variant

base

~7× speed · usable CPU default

74M parameters. The lowest checkpoint that produces broadly usable English transcripts on clean audio. Still fits inside ~1 GB VRAM and runs comfortably on consumer CPUs. A base.en English-only variant is also published.

base
74M params · ~1 GB VRAM · ~7× relative speed vs large · multilingual + base.en variant

small

~4× speed · CPU sweet spot

244M parameters. The point at which English WER on clean audio gets close to medium for most use cases. Comfortable on a 2 GB GPU or a modern CPU. A small.en English-only variant is published alongside.

small
244M params · ~2 GB VRAM · ~4× relative speed vs large · multilingual + small.en variant

medium

~2× speed · accuracy/VRAM balance

769M parameters. Strong multilingual accuracy on a 6–8 GB GPU. Common default for desktop transcription where large won't fit alongside the rest of the stack. A medium.en English-only variant is published alongside.

medium
769M params · ~5 GB VRAM · ~2× relative speed vs large · multilingual + medium.en variant

large-v3

Flagship accuracy · 99 languages

1.55B parameters. The full large encoder-decoder — best WER in the family and the baseline most papers compare against. Needs ~10 GB VRAM at fp16. Reference card: openai/whisper-large-v3 on Hugging Face.

large-v3
1.55B params · ~10 GB VRAM · 1× relative speed (baseline) · 99 languages · multilingual only

large-v3-turbo

Newer · ~8× speed at near-large quality

809M parameters. A 2024 release derived from large-v3 by collapsing the decoder from 32 layers to 4. Roughly 8× faster than large at near-equivalent transcription quality. Note: not trained for translation. Reference card: openai/whisper-large-v3-turbo on Hugging Face.

turbo
809M params · ~6 GB VRAM · ~8× relative speed · 99 languages · transcription only (no translate task)

Selecting at runtime: pass the model name as the first arg to whisper.load_model("turbo") or the --model CLI flag. Weights download from OpenAI's CDN on first use and cache under ~/.cache/whisper. For production throughput on the same weights see faster-whisper (CTranslate2) or whisper.cpp (C/C++).

Setup recipes · pick one and copy

Three runnable configurations from the openai/whisper README. Each block runs against the latest tagged release (v20250625 or later) and Python ≥3.9.

1Install + first transcript (CLI)

pip install · ffmpeg required · single-command transcription with SRT output. The default starting point.

# openai-whisper (latest, v20250625+) · Python >=3.9
# 1) ffmpeg is a hard runtime dep — install it first.
#    macOS:    brew install ffmpeg
#    Ubuntu:   sudo apt update && sudo apt install ffmpeg
#    Windows:  choco install ffmpeg     (or scoop install ffmpeg)

pip install -U openai-whisper

# 2) Transcribe a file directly to SRT subtitles.
whisper input.mp3 --model turbo --output_format srt

# Other useful flags:
#   --language Japanese        force a language (skip auto-detect)
#   --task translate           translate to English (not supported by 'turbo')
#   --device cuda              run on GPU
#   --output_format all        write txt + srt + vtt + tsv + json

Source: openai/whisper · Command-line usage ↗. PyPI: openai-whisper ↗

2Python API + word timestamps

load_model() · transcribe() with word_timestamps=True · iterate segments and per-word timings.

# Python >=3.9, openai-whisper installed.
import whisper

model = whisper.load_model("turbo")   # or tiny / base / small / medium / large-v3

result = model.transcribe(
    "input.mp3",
    word_timestamps=True,
    # language="en",                 # skip auto-detect if you know it
    # task="translate",              # translate to English (not on 'turbo')
)

print(result["text"])
for seg in result["segments"]:
    print(f"[{seg['start']:6.2f} -> {seg['end']:6.2f}] {seg['text']}")
    for w in seg.get("words", []):
        print(f"   {w['start']:6.2f}-{w['end']:6.2f}  {w['word']!r}")

Source: openai/whisper · Python usage ↗. Word-timestamp accuracy drifts on long-form audio — for tighter per-word timing see WhisperX (forced alignment).

3GPU + half-precision

device="cuda" + fp16=True · the production knobs on the reference implementation. Cuts VRAM and wall-clock roughly in half on supported GPUs.

# CUDA toolkit + a matching torch build already installed.
import whisper

model = whisper.load_model("large-v3", device="cuda")

result = model.transcribe(
    "podcast.wav",
    fp16=True,                 # default True on CUDA; explicit here for clarity
    beam_size=5,
    best_of=5,
    temperature=0.0,
    # condition_on_previous_text=False,  # try if you see hallucination loops
)

print(result["text"][:500])

Reference: transcribe.py options ↗. For ~4× faster GPU inference on the same weights, switch to faster-whisper.

Features

Speaker diarization	No
Word-level timestamps	Yes
Streaming / real-time	No
Languages supported	99
HIPAA eligible	No

OpenAI Whisper vs Whipscribe

Feature	OpenAI Whisper	Whipscribe
Category	Open source	Transcription APIs
Pricing	free	free beta
Speaker diarization	No	Yes
Word timestamps	Yes	Yes
Streaming	No	No
Languages	99	99
Platforms	Linux, macOS, Windows, GPU	Web, API, MCP

Alternatives to OpenAI Whisper

whisper.cpp

Georgi Gerganov

C/C++ port of Whisper — runs on anything, from a Raspberry Pi to Apple Silicon.

OSS · MIT ★ 48.8k

faster-whisper

SYSTRAN

4× faster than reference Whisper using CTranslate2 — production sweet spot.

OSS · MIT ★ 22.3k

whisperX

Max Bain

Faster-whisper + forced alignment + speaker diarization in one pipeline.

OSS · BSD‑2‑Clause ★ 21.4k

Frequently asked about OpenAI Whisper

Is OpenAI Whisper free?

Yes. The openai-whisper package and all released model weights are MIT-licensed, free for commercial and non-commercial use. Inference cost is whatever hardware you run it on.

Does Whisper support speaker diarization?

No. Vanilla Whisper outputs text + segment timestamps but does not label speakers. To get 'who said what,' pair it with a diarization library (e.g. pyannote) or use whisperX, which bundles both.

What's the difference between Whisper and faster-whisper?

Same underlying model weights; different runtime. faster-whisper uses CTranslate2 and is roughly 4x faster on GPU with lower VRAM use. Accuracy is essentially identical. For production, faster-whisper is usually the better choice.

Can Whisper run on CPU?

Yes, but it's slow. Real-time factor on a modern laptop CPU with the large-v3 model is well below 1x. For CPU-bound workloads, whisper.cpp is dramatically faster than the reference Python implementation.

Does Whisper produce word-level timestamps?

The reference implementation has a word_timestamps flag, but timings can drift on long-form audio. For more accurate per-word timing, use whisperX (forced alignment) or stable-ts.

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.