Looking at distil-whisper? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

distil-whisper

Name: distil-whisper
Author: Hugging Face

by Hugging Face

Hugging Face's distilled Whisper. Knowledge-distilled from large-v3 into a 756M-param student that runs ~6× faster within 1% WER on English. MIT, drop-in via transformers.

TL;DR

Distil-Whisper is a knowledge-distilled Whisper from Hugging Face, trained via large-scale pseudo-labelling described in arXiv:2311.00430 (Gandhi, von Platen, Rush · 2023). The flagship distil-large-v3 is 756M params — 49% smaller than whisper-large-v3 at 6.3× lower latency on long-form audio and within 1% WER of the teacher on out-of-distribution data. The newer distil-large-v3.5 (Aug 2024) was trained on 98k hours — 4.5× more data than v3 — and drops short-form WER to 7.08.

Best for English-only production workloads where latency, VRAM, and cost dominate the decision — long-form podcasts, batch pipelines, on-device inference. Pair with speculative decoding against whisper-large-v3 for a free ~2× speedup at byte-identical outputs. Multilingual variants are not yet shipped; for non-English use faster-whisper or the teacher. License: MIT; install via pip install transformers.

What it is

A Hugging Face distillation of Whisper that keeps most of the accuracy while cutting inference cost. Best when you know your audio is English and you want to serve many concurrent requests on modest hardware. MIT-licensed.

Best for: English-only workloads where latency and cost matter more than multilingual coverage.
Watch out for: English-only in v3; slightly lower WER than Whisper-large-v3 on long-form.

Install / use

pip install transformers  # then load distil-whisper/distil-large-v3

Pick a checkpoint · 5 variants

Every distil-whisper checkpoint is English-only and MIT-licensed. The model id is what you pass to from_pretrained(...) — transformers downloads the safetensors weights on first use, no manual fetch. WER columns are out-of-distribution Open-ASR averages quoted on each model card; latency is relative to whisper-large-v3.

distil-large-v3.5

Latest · 4.5× more training data than v3

August 2024 refresh: trained on 98k hours (vs v3's 22k) with a 'patient teacher' and SpecAugment. Short-form WER drops to 7.08; long-form to 11.39. Roughly 1.5× faster than whisper-large-v3-turbo and a drop-in replacement for earlier distil checkpoints.

distil-whisper/distil-large-v3.5
756M params · ~1.5 GB on disk (fp16) · ~2 GB VRAM at fp16 · English · short WER 7.08 / long WER 11.39 · ct2/ggml/onnx variants exist

distil-large-v3

The widely-deployed flagship distilled from whisper-large-v3. 756M params, within 1% WER on OOD evaluation, and optimised for OpenAI's sequential long-form algorithm. The default choice when you want the v3-quality benefit without v3.5's newer-build risk.

distil-whisper/distil-large-v3
756M params · ~1.5 GB on disk (fp16) · ~2 GB VRAM at fp16 · English · long-form WER 10.8 · ct2/ggml variants available

distil-large-v2

Original release · distilled from whisper-large-v2

The first distil-large checkpoint, distilled from whisper-large-v2. 5.8× faster than its teacher, short-form WER 10.1. Kept for reproducibility and for stacks pinned to large-v2 alignment — prefer v3 or v3.5 for new builds.

distil-whisper/distil-large-v2
756M params · ~1.5 GB on disk · ~2 GB VRAM at fp16 · English · short WER 10.1 / long WER 11.6

distil-medium.en

Mid-size · 394M params for 6.8× latency

The medium-tier student. Fastest relative latency in the family at 6.8× over large-v2 while keeping short-form WER at 11.1. Picks itself when you have a 6-8 GB GPU and need headroom for diarization or VAD alongside ASR.

distil-whisper/distil-medium.en
394M params · ~790 MB on disk · ~1 GB VRAM at fp16 · English · short WER 11.1 / long WER 12.4

distil-small.en

Smallest · on-device / edge inference

Smallest checkpoint in the family at 166M. Short-form WER 12.1, long-form 12.8 — meaningfully behind the larger students but small enough for laptop CPU, mobile, or shared-GPU deployments. RTFX of 331 on the ESB benchmark.

distil-whisper/distil-small.en
166M params · ~330 MB on disk · ~500 MB RAM at int8 on CPU · English · short WER 12.1 / long WER 12.8

Selecting at runtime: pass the model id straight to AutoModelForSpeechSeq2Seq.from_pretrained(...) or the pipeline("automatic-speech-recognition", model=...) shortcut. For multilingual workloads use openai/whisper or faster-whisper; for CT2 conversions of these distil checkpoints see Systran/faster-distil-whisper-large-v3.

Run distil-whisper · pick a runtime

Three runnable configurations. Recipe 1 is the canonical HF transformers path; recipes 2 and 3 trade transformers for faster-whisper (CTranslate2) and whisper.cpp (GGML) when latency or footprint matters more than parity with the upstream tooling.

1HF transformers · pipeline API

pip install transformers · pipeline("automatic-speech-recognition", ...). Canonical path from the model card, works on CPU or GPU.

# distil-whisper via Hugging Face transformers >=4.39
pip install --upgrade transformers accelerate datasets[audio]

# transcribe.py
import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype  = torch.float16 if torch.cuda.is_available() else torch.float32

pipe = pipeline(
    "automatic-speech-recognition",
    model="distil-whisper/distil-large-v3.5",  # or distil-large-v3, distil-small.en
    torch_dtype=dtype,
    device=device,
    chunk_length_s=25,   # long-form: chunk into 25s windows
    batch_size=16,       # GPU only; on CPU set batch_size=1
)

result = pipe("meeting.wav", return_timestamps=True)
print(result["text"])
for chunk in result["chunks"]:
    print(chunk["timestamp"], chunk["text"])

Source: distil-large-v3.5 model card ↗. GitHub: huggingface/distil-whisper ↗

2faster-whisper · CTranslate2 for production

Drop transformers for faster-whisper + the Systran CT2 conversion. ~2× the throughput at the same WER, smaller VRAM, batched decode.

# CTranslate2 path · production sweet spot
pip install faster-whisper

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel(
    "Systran/faster-distil-whisper-large-v3",
    device="cuda",
    compute_type="float16",   # int8_float16 halves VRAM
)

# Parallel decode across 30s chunks
batched = BatchedInferencePipeline(model=model)

segments, info = batched.transcribe(
    "podcast.wav",
    batch_size=16,            # 8 on 8 GB, 16 on 16 GB, 32 on 24 GB+
    beam_size=5,
    word_timestamps=True,
)
for s in segments:
    print(f"[{s.start:6.2f} -> {s.end:6.2f}] {s.text}")

Source: Systran/faster-distil-whisper-large-v3 ↗. Engine docs: SYSTRAN/faster-whisper README ↗

3whisper.cpp · GGML quantised for laptop / edge

Pure-C inference via whisper.cpp. Quantised GGML weights run distil-large-v3 on a MacBook M-series CPU at near real-time without Python.

# whisper.cpp + distil-large-v3 GGML (laptop / edge)
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make

# fetch the official GGML conversion (auto-grabs ggml-distil-large-v3.bin)
bash ./models/download-ggml-model.sh distil-large-v3

# transcribe a 16kHz mono wav (resample first if needed)
./build/bin/whisper-cli \
  -m models/ggml-distil-large-v3.bin \
  -f samples/meeting.wav \
  -l en \
  --output-srt --output-vtt

# quantise to ~half the disk + RAM with a small WER cost
./build/bin/quantize \
  models/ggml-distil-large-v3.bin \
  models/ggml-distil-large-v3-q5_0.bin q5_0

Model card: distil-whisper/distil-large-v3-ggml ↗. Engine: ggml-org/whisper.cpp ↗

Features

Speaker diarization	No
Word-level timestamps	Yes
Streaming / real-time	No
Languages supported	1
HIPAA eligible	No

distil-whisper vs Whipscribe

Feature	distil-whisper	Whipscribe
Category	Open source	Transcription APIs
Pricing	free	free beta
Speaker diarization	No	Yes
Word timestamps	Yes	Yes
Streaming	No	No
Languages	1	99
Platforms	Linux, macOS, GPU, CPU	Web, API, MCP

Alternatives to distil-whisper

OpenAI Whisper

OpenAI

The reference open-source multilingual ASR model from OpenAI.

OSS · MIT ★ 98.1k

whisper.cpp

Georgi Gerganov

C/C++ port of Whisper — runs on anything, from a Raspberry Pi to Apple Silicon.

OSS · MIT ★ 48.8k

faster-whisper

SYSTRAN

4× faster than reference Whisper using CTranslate2 — production sweet spot.

OSS · MIT ★ 22.3k

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.