Looking at distil-whisper? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

distil-whisper

by Hugging Face

Hugging Face's distilled Whisper. Knowledge-distilled from large-v3 into a 756M-param student that runs ~6× faster within 1% WER on English. MIT, drop-in via transformers.

TL;DR

Distil-Whisper is a knowledge-distilled Whisper from Hugging Face, trained via large-scale pseudo-labelling described in arXiv:2311.00430 (Gandhi, von Platen, Rush · 2023). The flagship distil-large-v3 is 756M params49% smaller than whisper-large-v3 at 6.3× lower latency on long-form audio and within 1% WER of the teacher on out-of-distribution data. The newer distil-large-v3.5 (Aug 2024) was trained on 98k hours — 4.5× more data than v3 — and drops short-form WER to 7.08.

Best for English-only production workloads where latency, VRAM, and cost dominate the decision — long-form podcasts, batch pipelines, on-device inference. Pair with speculative decoding against whisper-large-v3 for a free ~2× speedup at byte-identical outputs. Multilingual variants are not yet shipped; for non-English use faster-whisper or the teacher. License: MIT; install via pip install transformers.

Category
Open source
License
MIT
Stars
★ 4.1k
Last push
2025-01-08
Pricing
free
Platforms
Linux, macOS, GPU, CPU

What it is

A Hugging Face distillation of Whisper that keeps most of the accuracy while cutting inference cost. Best when you know your audio is English and you want to serve many concurrent requests on modest hardware. MIT-licensed.

Best for: English-only workloads where latency and cost matter more than multilingual coverage.
Watch out for: English-only in v3; slightly lower WER than Whisper-large-v3 on long-form.

Install / use

pip install transformers  # then load distil-whisper/distil-large-v3

Pick a checkpoint · 5 variants

Every distil-whisper checkpoint is English-only and MIT-licensed. The model id is what you pass to from_pretrained(...) — transformers downloads the safetensors weights on first use, no manual fetch. WER columns are out-of-distribution Open-ASR averages quoted on each model card; latency is relative to whisper-large-v3.

distil-large-v3.5
Latest · 4.5× more training data than v3

August 2024 refresh: trained on 98k hours (vs v3's 22k) with a 'patient teacher' and SpecAugment. Short-form WER drops to 7.08; long-form to 11.39. Roughly 1.5× faster than whisper-large-v3-turbo and a drop-in replacement for earlier distil checkpoints.

distil-whisper/distil-large-v3.5
756M params · ~1.5 GB on disk (fp16) · ~2 GB VRAM at fp16 · English · short WER 7.08 / long WER 11.39 · ct2/ggml/onnx variants exist
distil-large-v3
Recommended default · 6.3× faster than teacher

The widely-deployed flagship distilled from whisper-large-v3. 756M params, within 1% WER on OOD evaluation, and optimised for OpenAI's sequential long-form algorithm. The default choice when you want the v3-quality benefit without v3.5's newer-build risk.

distil-whisper/distil-large-v3
756M params · ~1.5 GB on disk (fp16) · ~2 GB VRAM at fp16 · English · long-form WER 10.8 · ct2/ggml variants available
distil-large-v2
Original release · distilled from whisper-large-v2

The first distil-large checkpoint, distilled from whisper-large-v2. 5.8× faster than its teacher, short-form WER 10.1. Kept for reproducibility and for stacks pinned to large-v2 alignment — prefer v3 or v3.5 for new builds.

distil-whisper/distil-large-v2
756M params · ~1.5 GB on disk · ~2 GB VRAM at fp16 · English · short WER 10.1 / long WER 11.6
distil-medium.en
Mid-size · 394M params for 6.8× latency

The medium-tier student. Fastest relative latency in the family at 6.8× over large-v2 while keeping short-form WER at 11.1. Picks itself when you have a 6-8 GB GPU and need headroom for diarization or VAD alongside ASR.

distil-whisper/distil-medium.en
394M params · ~790 MB on disk · ~1 GB VRAM at fp16 · English · short WER 11.1 / long WER 12.4
distil-small.en
Smallest · on-device / edge inference

Smallest checkpoint in the family at 166M. Short-form WER 12.1, long-form 12.8 — meaningfully behind the larger students but small enough for laptop CPU, mobile, or shared-GPU deployments. RTFX of 331 on the ESB benchmark.

distil-whisper/distil-small.en
166M params · ~330 MB on disk · ~500 MB RAM at int8 on CPU · English · short WER 12.1 / long WER 12.8
Selecting at runtime: pass the model id straight to AutoModelForSpeechSeq2Seq.from_pretrained(...) or the pipeline("automatic-speech-recognition", model=...) shortcut. For multilingual workloads use openai/whisper or faster-whisper; for CT2 conversions of these distil checkpoints see Systran/faster-distil-whisper-large-v3.

Run distil-whisper · pick a runtime

Three runnable configurations. Recipe 1 is the canonical HF transformers path; recipes 2 and 3 trade transformers for faster-whisper (CTranslate2) and whisper.cpp (GGML) when latency or footprint matters more than parity with the upstream tooling.

1HF transformers · pipeline API

pip install transformers · pipeline("automatic-speech-recognition", ...). Canonical path from the model card, works on CPU or GPU.

# distil-whisper via Hugging Face transformers >=4.39
pip install --upgrade transformers accelerate datasets[audio]

# transcribe.py
import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype  = torch.float16 if torch.cuda.is_available() else torch.float32

pipe = pipeline(
    "automatic-speech-recognition",
    model="distil-whisper/distil-large-v3.5",  # or distil-large-v3, distil-small.en
    torch_dtype=dtype,
    device=device,
    chunk_length_s=25,   # long-form: chunk into 25s windows
    batch_size=16,       # GPU only; on CPU set batch_size=1
)

result = pipe("meeting.wav", return_timestamps=True)
print(result["text"])
for chunk in result["chunks"]:
    print(chunk["timestamp"], chunk["text"])
2faster-whisper · CTranslate2 for production

Drop transformers for faster-whisper + the Systran CT2 conversion. ~2× the throughput at the same WER, smaller VRAM, batched decode.

# CTranslate2 path · production sweet spot
pip install faster-whisper

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel(
    "Systran/faster-distil-whisper-large-v3",
    device="cuda",
    compute_type="float16",   # int8_float16 halves VRAM
)

# Parallel decode across 30s chunks
batched = BatchedInferencePipeline(model=model)

segments, info = batched.transcribe(
    "podcast.wav",
    batch_size=16,            # 8 on 8 GB, 16 on 16 GB, 32 on 24 GB+
    beam_size=5,
    word_timestamps=True,
)
for s in segments:
    print(f"[{s.start:6.2f} -> {s.end:6.2f}] {s.text}")
3whisper.cpp · GGML quantised for laptop / edge

Pure-C inference via whisper.cpp. Quantised GGML weights run distil-large-v3 on a MacBook M-series CPU at near real-time without Python.

# whisper.cpp + distil-large-v3 GGML (laptop / edge)
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make

# fetch the official GGML conversion (auto-grabs ggml-distil-large-v3.bin)
bash ./models/download-ggml-model.sh distil-large-v3

# transcribe a 16kHz mono wav (resample first if needed)
./build/bin/whisper-cli \
  -m models/ggml-distil-large-v3.bin \
  -f samples/meeting.wav \
  -l en \
  --output-srt --output-vtt

# quantise to ~half the disk + RAM with a small WER cost
./build/bin/quantize \
  models/ggml-distil-large-v3.bin \
  models/ggml-distil-large-v3-q5_0.bin q5_0

Features

Speaker diarizationNo
Word-level timestampsYes
Streaming / real-timeNo
Languages supported1
HIPAA eligibleNo

Links

distil-whisper vs Whipscribe

Featuredistil-whisperWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsYesYes
StreamingNoNo
Languages199
PlatformsLinux, macOS, GPU, CPUWeb, API, MCP

Alternatives to distil-whisper

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.