Looking at faster-whisper? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

faster-whisper

by SYSTRAN

Production-grade Whisper inference. CTranslate2 rebuild, ~4× faster than reference openai-whisper at the same WER, smaller VRAM, drop-in Python API. Pick a model, copy a recipe.

TL;DR

MIT-licensed reimplementation of OpenAI Whisper on the CTranslate2 engine. ~4× faster on a single GPU and roughly 2× less VRAM at the same accuracy — int8 on GPU lands large-v2 inference in ~59s for a 13-min clip at ~2.9 GB VRAM versus reference Whisper's 2m23s at 4.7 GB. Supports word timestamps, Silero VAD filter, and a BatchedInferencePipeline for parallel chunk decoding.

Best for production batch pipelines, single-GPU servers, anyone hitting OOM on stock Whisper, and CPU-only laptops via int8. No diarization — pair with pyannote, or step up to WhisperX. License: MIT (and CTranslate2 itself is MIT). Latest stable: 1.2.1 on PyPI, Python ≥3.9.

Category
Open source
License
MIT
Stars
★ 22.3k
Last push
2025-11-19
Pricing
free
Platforms
Linux, macOS, Windows, GPU

What it is

faster-whisper wraps Whisper in CTranslate2 — a tuned inference engine for transformer models. On a single consumer GPU it's ~4× faster than reference Whisper and uses ~2× less VRAM, with essentially identical accuracy. This is what most production Whisper stacks actually run, including Whipscribe. MIT-licensed and stable.

Best for: Production batch transcription on GPU, cheapest $/hour of any hosted Whisper variant.
Watch out for: No diarization (pair with pyannote or whisperX); model conversion step can trip people up the first time.

Install / use

pip install faster-whisper

Pick a model · 6 variants

faster-whisper users almost always pick a model before they pick a deployment. Each card below links the canonical Hugging Face model card — the library fetches the CTranslate2 weights from there on first run, no manual download. VRAM figures are at fp16 unless noted; int8 roughly halves them.

large-v3
Flagship accuracy · 99 languages

The reference Whisper-large-v3 conversion. Lowest WER in the family on most languages and the default starting point for accuracy-sensitive batch jobs. Drop-in: WhisperModel("large-v3").

Systran/faster-whisper-large-v3
~3.1 GB on disk · ~3 GB VRAM at fp16 · ~1.5 GB VRAM at int8 · 99 languages
large-v3-turbo
Throughput · same model, faster decode

CT2 conversion of OpenAI's whisper-large-v3-turbo (4 decoder layers instead of 32). On the same GPU you get materially higher throughput at quality close to large-v3 — the sweet spot for high-volume batch transcription.

deepdml/faster-whisper-large-v3-turbo-ct2
~1.5 GB on disk · ~2 GB VRAM at fp16 · 100 languages · community CT2 conversion
distil-large-v3
Distilled · English-optimised speed

HuggingFace's distilled Whisper-large-v3 in CT2 form. Roughly 6× faster than large-v3 at near-parity English WER. Trained primarily on English so accuracy on other languages will drop — use one of the larger models if you need multilingual.

Systran/faster-distil-whisper-large-v3
~1.5 GB on disk · ~1.6 GB VRAM at fp16 · English-focused (multilingual support inherited but weaker)
medium
Sweet spot for low-VRAM GPUs

Five decoder layers fewer than large, materially smaller VRAM footprint, WER still strong for non-adversarial audio. The default pick on a 6–8 GB consumer GPU when large-v3 won't fit alongside the rest of your stack.

Systran/faster-whisper-medium
~1.5 GB on disk · ~1.6 GB VRAM at fp16 · ~800 MB at int8 · 99 languages
small
Laptop CPU · int8 compute

What faster-whisper's own README benchmarks on CPU. With compute_type="int8" a 13-min clip runs in ~2m37s on CPU vs reference Whisper's 6m58s — usable for desk-side batch work without a GPU.

Systran/faster-whisper-small
~480 MB on disk · ~1 GB RAM at int8 on CPU · 99 languages
tiny
Edge / ARM · fastest decode

Smallest CT2 Whisper variant. WER trails the larger models materially but the decode loop is small enough to run usable inference on edge ARM boxes or alongside heavier workloads on shared hardware.

Systran/faster-whisper-tiny
~75 MB on disk · ~400 MB RAM at int8 · 99 languages
Selecting at runtime: pass the HF model id as the first argument to WhisperModel(...) and faster-whisper will download + cache on first use. For comparison, see openai/whisper (reference implementation) and WhisperX (faster-whisper plus pyannote diarization and forced alignment).

Setup recipes · pick one and copy

Three runnable configurations covering the most common faster-whisper deployments. Each block is copy-and-run against faster-whisper 1.2.x.

1Install + first transcript (Python)

pip install · CPU or GPU · 8-line transcription script. The default choice.

# faster-whisper 1.2.x · Python >=3.9
pip install faster-whisper

# transcribe.py
from faster_whisper import WhisperModel

# device="cpu" + compute_type="int8" also works
model = WhisperModel(
    "large-v3-turbo",      # or large-v3, medium, small, tiny
    device="cuda",
    compute_type="float16",
)

segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,
)
print(f"detected language: {info.language} (p={info.language_probability:.2f})")
for s in segments:
    print(f"[{s.start:6.2f} -> {s.end:6.2f}] {s.text}")
2GPU + batched decode (production)

CUDA + cuDNN · float16 weights · BatchedInferencePipeline for parallel chunks. This is the production sweet spot.

# CUDA + cuDNN already installed on the box.
# For CUDA 12, cuDNN 9 is bundled via nvidia-cudnn-cu12.
pip install faster-whisper nvidia-cudnn-cu12

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel(
    "large-v3-turbo",
    device="cuda",
    compute_type="float16",   # int8_float16 halves VRAM
)

# batched = parallel decode across 30s chunks
batched = BatchedInferencePipeline(model=model)

segments, info = batched.transcribe(
    "podcast.wav",
    batch_size=16,            # 8 on 8 GB, 16 on 16 GB, 32 on 24 GB+
    beam_size=5,
)
for s in segments:
    print(s.start, s.end, s.text)
3Word timestamps + VAD filter

word_timestamps=True for force-aligned words · vad_filter=True drops silence before decode (Silero VAD).

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=500,   # drop silences >=500ms
    ),
)

for s in segments:
    print(f"[{s.start:6.2f} -> {s.end:6.2f}] {s.text}")
    for w in s.words:
        print(f"  {w.start:6.2f}-{w.end:6.2f}  {w.word!r}")
VAD uses the bundled Silero VAD ↗. For speaker labels on top of words, use WhisperX or pyannote.

Features

Speaker diarizationNo
Word-level timestampsYes
Streaming / real-timeNo
Languages supported99
HIPAA eligibleNo

Links

faster-whisper vs Whipscribe

Featurefaster-whisperWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsYesYes
StreamingNoNo
Languages9999
PlatformsLinux, macOS, Windows, GPUWeb, API, MCP

Alternatives to faster-whisper

Frequently asked about faster-whisper

Is faster-whisper more accurate than OpenAI Whisper?

Accuracy is essentially identical — same model weights, same training. The difference is runtime speed and VRAM usage, both of which faster-whisper improves materially. Any measurable WER gap is in the noise.

How much faster is faster-whisper?

Published benchmarks show roughly 4x faster inference on a single GPU versus the reference openai-whisper package, with ~2x lower VRAM. Your mileage varies with batch size, model size, and hardware.

Does faster-whisper support diarization?

No. It handles transcription only. Combine it with pyannote for speaker labels, or use whisperX, which integrates both.

What license is faster-whisper?

MIT — permissive for commercial and non-commercial use. CTranslate2 (the underlying engine) is also MIT-licensed.

Can faster-whisper run on CPU?

Yes. Use compute_type='int8' and a smaller model (base or small) for acceptable speed. For production CPU workloads, whisper.cpp is usually faster.

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.