Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
faster-whisper
Production-grade Whisper inference. CTranslate2 rebuild, ~4× faster than reference openai-whisper at the same WER, smaller VRAM, drop-in Python API. Pick a model, copy a recipe.
MIT-licensed reimplementation of OpenAI Whisper on the CTranslate2 engine. ~4× faster on a single GPU and roughly 2× less VRAM at the same accuracy — int8 on GPU lands large-v2 inference in ~59s for a 13-min clip at ~2.9 GB VRAM versus reference Whisper's 2m23s at 4.7 GB. Supports word timestamps, Silero VAD filter, and a BatchedInferencePipeline for parallel chunk decoding.
Best for production batch pipelines, single-GPU servers, anyone hitting OOM on stock Whisper, and CPU-only laptops via int8. No diarization — pair with pyannote, or step up to WhisperX. License: MIT (and CTranslate2 itself is MIT). Latest stable: 1.2.1 on PyPI, Python ≥3.9.
What it is
faster-whisper wraps Whisper in CTranslate2 — a tuned inference engine for transformer models. On a single consumer GPU it's ~4× faster than reference Whisper and uses ~2× less VRAM, with essentially identical accuracy. This is what most production Whisper stacks actually run, including Whipscribe. MIT-licensed and stable.
Watch out for: No diarization (pair with pyannote or whisperX); model conversion step can trip people up the first time.
Install / use
pip install faster-whisper
Pick a model · 6 variants
faster-whisper users almost always pick a model before they pick a deployment. Each card below links the canonical Hugging Face model card — the library fetches the CTranslate2 weights from there on first run, no manual download. VRAM figures are at fp16 unless noted; int8 roughly halves them.
The reference Whisper-large-v3 conversion. Lowest WER in the family on most languages and the default starting point for accuracy-sensitive batch jobs. Drop-in: WhisperModel("large-v3").
~3.1 GB on disk · ~3 GB VRAM at fp16 · ~1.5 GB VRAM at int8 · 99 languages
CT2 conversion of OpenAI's whisper-large-v3-turbo (4 decoder layers instead of 32). On the same GPU you get materially higher throughput at quality close to large-v3 — the sweet spot for high-volume batch transcription.
~1.5 GB on disk · ~2 GB VRAM at fp16 · 100 languages · community CT2 conversion
HuggingFace's distilled Whisper-large-v3 in CT2 form. Roughly 6× faster than large-v3 at near-parity English WER. Trained primarily on English so accuracy on other languages will drop — use one of the larger models if you need multilingual.
~1.5 GB on disk · ~1.6 GB VRAM at fp16 · English-focused (multilingual support inherited but weaker)
Five decoder layers fewer than large, materially smaller VRAM footprint, WER still strong for non-adversarial audio. The default pick on a 6–8 GB consumer GPU when large-v3 won't fit alongside the rest of your stack.
~1.5 GB on disk · ~1.6 GB VRAM at fp16 · ~800 MB at int8 · 99 languages
What faster-whisper's own README benchmarks on CPU. With compute_type="int8" a 13-min clip runs in ~2m37s on CPU vs reference Whisper's 6m58s — usable for desk-side batch work without a GPU.
~480 MB on disk · ~1 GB RAM at int8 on CPU · 99 languages
Smallest CT2 Whisper variant. WER trails the larger models materially but the decode loop is small enough to run usable inference on edge ARM boxes or alongside heavier workloads on shared hardware.
~75 MB on disk · ~400 MB RAM at int8 · 99 languages
WhisperModel(...) and faster-whisper will download + cache on first use. For comparison, see openai/whisper (reference implementation) and WhisperX (faster-whisper plus pyannote diarization and forced alignment).Setup recipes · pick one and copy
Three runnable configurations covering the most common faster-whisper deployments. Each block is copy-and-run against faster-whisper 1.2.x.
pip install · CPU or GPU · 8-line transcription script. The default choice.
# faster-whisper 1.2.x · Python >=3.9
pip install faster-whisper
# transcribe.py
from faster_whisper import WhisperModel
# device="cpu" + compute_type="int8" also works
model = WhisperModel(
"large-v3-turbo", # or large-v3, medium, small, tiny
device="cuda",
compute_type="float16",
)
segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
)
print(f"detected language: {info.language} (p={info.language_probability:.2f})")
for s in segments:
print(f"[{s.start:6.2f} -> {s.end:6.2f}] {s.text}")
CUDA + cuDNN · float16 weights · BatchedInferencePipeline for parallel chunks. This is the production sweet spot.
# CUDA + cuDNN already installed on the box.
# For CUDA 12, cuDNN 9 is bundled via nvidia-cudnn-cu12.
pip install faster-whisper nvidia-cudnn-cu12
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel(
"large-v3-turbo",
device="cuda",
compute_type="float16", # int8_float16 halves VRAM
)
# batched = parallel decode across 30s chunks
batched = BatchedInferencePipeline(model=model)
segments, info = batched.transcribe(
"podcast.wav",
batch_size=16, # 8 on 8 GB, 16 on 16 GB, 32 on 24 GB+
beam_size=5,
)
for s in segments:
print(s.start, s.end, s.text)
word_timestamps=True for force-aligned words · vad_filter=True drops silence before decode (Silero VAD).
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
"meeting.wav",
beam_size=5,
word_timestamps=True,
vad_filter=True,
vad_parameters=dict(
min_silence_duration_ms=500, # drop silences >=500ms
),
)
for s in segments:
print(f"[{s.start:6.2f} -> {s.end:6.2f}] {s.text}")
for w in s.words:
print(f" {w.start:6.2f}-{w.end:6.2f} {w.word!r}")
Features
| Speaker diarization | No |
| Word-level timestamps | Yes |
| Streaming / real-time | No |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- SYSTRAN/faster-whisper ↗ ↗main repo · README has the canonical benchmark table and BatchedInferencePipeline docs
- PyPI · faster-whisper 1.2.1 ↗ ↗latest stable, released 2025-10-31 · Python ≥3.9
- huggingface.co/Systran ↗ ↗SYSTRAN-maintained model collection · large-v3, large-v2, medium, small, tiny, base
- CTranslate2 docs ↗ ↗the OpenNMT inference engine faster-whisper is built on · MIT-licensed
- CTranslate2 quantization guide ↗ ↗int8 / int8_float16 / float16 / float32 tradeoffs · what compute_type to pick
- m-bain/whisperX ↗ ↗faster-whisper + pyannote diarization + forced alignment for word-accurate timestamps
- jianfch/stable-ts ↗ ↗Whisper output post-processing for more reliable timestamps and SRT/VTT/ASS export
- snakers4/silero-vad ↗ ↗the VAD model faster-whisper bundles for vad_filter=True
faster-whisper vs Whipscribe
| Feature | faster-whisper | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | Yes | Yes |
| Streaming | No | No |
| Languages | 99 | 99 |
| Platforms | Linux, macOS, Windows, GPU | Web, API, MCP |
Alternatives to faster-whisper
Frequently asked about faster-whisper
Is faster-whisper more accurate than OpenAI Whisper?
Accuracy is essentially identical — same model weights, same training. The difference is runtime speed and VRAM usage, both of which faster-whisper improves materially. Any measurable WER gap is in the noise.
How much faster is faster-whisper?
Published benchmarks show roughly 4x faster inference on a single GPU versus the reference openai-whisper package, with ~2x lower VRAM. Your mileage varies with batch size, model size, and hardware.
Does faster-whisper support diarization?
No. It handles transcription only. Combine it with pyannote for speaker labels, or use whisperX, which integrates both.
What license is faster-whisper?
MIT — permissive for commercial and non-commercial use. CTranslate2 (the underlying engine) is also MIT-licensed.
Can faster-whisper run on CPU?
Yes. Use compute_type='int8' and a smaller model (base or small) for acceptable speed. For production CPU workloads, whisper.cpp is usually faster.
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.