Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
whisperX
Whisper plus word-aligned timestamps plus speaker labels in a single Python CLI. Max Bain's pipeline — faster-whisper for ASR, wav2vec2 forced alignment, pyannote diarization. Pick a recipe.
WhisperX is a three-stage pipeline glued into one CLI and Python API. Stage 1 transcribes with faster-whisper (CTranslate2-backed Whisper, batched decode). Stage 2 runs wav2vec2 forced alignment so every word lands on a real audio boundary, not Whisper's drifted segment-level timestamps. Stage 3 calls pyannote.audio 3.1 for speaker diarization and assigns each aligned word to a speaker label.
Best for anywhere who said what when matters: podcast post-production, interview transcripts, meeting recordings, captioning workflows that need karaoke-style word highlighting, and research datasets with speaker turns. License: BSD-2-Clause on WhisperX itself; the pyannote models it downloads are gated on Hugging Face and have their own terms — read those before commercial deployment. Latest: v3.8.5, paper at arXiv:2303.00747 (INTERSPEECH 2023).
What it is
whisperX combines faster-whisper with forced alignment (wav2vec2) for word-accurate timestamps and pyannote for speaker diarization. If your audio has more than one speaker and you care about proper "Speaker 1 / Speaker 2" labeling, this is the open-source default. BSD-2 licensed.
Watch out for: Requires a HuggingFace token to download pyannote diarization models (gated); heavier first-run setup.
Install / use
pip install whisperx
Use cases · pick the workflow that matches
WhisperX earns its place when alignment and diarization both matter. These are the workflows the README, Discussions, and the paper authors call out as primary fits — each card links the relevant section so you can jump straight to the right recipe.
Two-host or guest interviews where you need accurate speaker labels for show notes, transcript pages, and chapter markers. WhisperX gives you per-word speaker attribution out of the box — drop it into a post-prod script and you get a publishable transcript without manual labeling.
Pair --diarize with --min_speakers and --max_speakers when you already know how many voices to expect.
Field interviews, sales calls, customer-research sessions, internal meetings — typically recorded on one channel with no per-speaker isolation. WhisperX runs VAD-based segmentation before transcribe, then pyannote-3.1 for clustering, which holds up reasonably on 2–6 speakers without per-mic separation.
VAD cuts long silences; batched faster-whisper decodes the chunks; pyannote labels them. Set HF_TOKEN once for diarization.
Word-by-word karaoke captioning needs every word boundary to match the audio. Whisper's native segment-level timestamps are too coarse — they drift by hundreds of milliseconds. The wav2vec2 alignment pass fixes this, and --highlight_words emits an SRT with the current word boldened per frame.
Output formats: srt, vtt, txt, tsv, json, aud. Use --highlight_words True for karaoke SRTs.
Word-accurate timestamps are the input format text-based video editors expect. Export WhisperX's per-word JSON, then feed it into Premiere's caption track or DaVinci Resolve's subtitle import. Speaker labels survive the round-trip so editors can color-code talking heads on the timeline.
JSON output preserves word-level start/end + speaker fields. tsv is the easiest to inspect by hand.
Building speech datasets, training a downstream model, or running linguistics analysis on conversational corpora. WhisperX is the closest open-source pipeline to a one-shot dataset annotator — Whisper transcript, wav2vec2 alignment for word boundaries, pyannote turns. Cite arXiv:2303.00747.
Per-language --align_model override: WAV2VEC2_ASR_LARGE_LV60K_960H for English, language-specific checkpoints for non-English.
Throughput-mode on a single GPU. The README's own benchmarks show ~70× realtime on large-v2 with batched faster-whisper and under 8 GB of VRAM at beam_size=5. Drop --compute_type int8 for half the VRAM at a small WER cost, useful on 6–8 GB consumer cards.
Tune --batch_size to fit VRAM (4 on 6 GB, 8 on 8 GB, 16+ on 16 GB). --compute_type int8 for laptops.
Setup recipes · pick one and copy
Three runnable configurations covering the most common WhisperX deployments. All blocks target whisperx 3.8.x with CUDA 12. Replace ${HF_TOKEN} with your own Hugging Face access token from hf.co/settings/tokens ↗.
pip install · single command CLI · SRT subtitles in one shot. The fastest way to verify the install works.
# whisperx 3.8.x · Python >=3.9 · CUDA 12 for GPU
pip install whisperx
# transcribe a file, write subtitles
whisperx input.mp3 \
--model large-v3 \
--compute_type float16 \
--batch_size 16 \
--output_format srt
# CPU-only laptop fallback
whisperx input.mp3 \
--model medium \
--device cpu \
--compute_type int8 \
--batch_size 4
Accept the pyannote model terms on Hugging Face first, then pass --diarize with your HF_TOKEN. Outputs per-word speaker labels.
# 1. Sign in at https://hf.co and accept the user conditions on:
# https://huggingface.co/pyannote/segmentation-3.0
# https://huggingface.co/pyannote/speaker-diarization-3.1
# 2. Create a read token at https://hf.co/settings/tokens
# 3. Export it (never hardcode):
export HF_TOKEN="hf_...your_token_here..."
# 4. Transcribe + align + diarize in one pass
whisperx input.mp3 \
--model large-v3 \
--diarize \
--hf_token "${HF_TOKEN}" \
--min_speakers 2 \
--max_speakers 4 \
--output_format srt
# karaoke-style word highlighting in the SRT
whisperx input.mp3 --model large-v3 --diarize \
--hf_token "${HF_TOKEN}" --highlight_words True --output_format srt
Three-stage call: load_model → load_align_model + align → DiarizationPipeline + assign_word_speakers. The pattern from the README.
import os
import whisperx
from whisperx.diarize import DiarizationPipeline
device = "cuda"
batch_size = 16
compute_type = "float16" # "int8" on low-VRAM GPUs
audio_file = "input.mp3"
HF_TOKEN = os.environ["HF_TOKEN"] # never hardcode
# 1. Transcribe (faster-whisper under the hood)
model = whisperx.load_model("large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# 2. Forced alignment (wav2vec2)
model_a, metadata = whisperx.load_align_model(
language_code=result["language"], device=device,
)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device,
return_char_alignments=False,
)
# 3. Diarize + assign speakers to each word
diarize_model = DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)
for seg in result["segments"]:
speaker = seg.get("speaker", "?")
print(f"[{seg['start']:6.2f} -> {seg['end']:6.2f}] {speaker}: {seg['text']}")
result["segments"] carries per-word start, end, word, and speaker fields after step 3.Features
| Speaker diarization | Yes |
| Word-level timestamps | Yes |
| Streaming / real-time | No |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- m-bain/whisperX ↗ ↗main repo · README with CLI flags, Python usage, and the canonical benchmark numbers · 21.8k stars
- arXiv:2303.00747 ↗ ↗the WhisperX paper · Bain, Huh, Han, Zisserman · INTERSPEECH 2023 · cite this in research
- WhisperX · Discussions ↗ ↗GitHub Discussions · install troubleshooting, language-specific alignment models, GPU memory tuning
- PyPI · whisperx ↗ ↗latest stable release · pip install whisperx · Python >=3.9
- SYSTRAN/faster-whisper ↗ ↗the CTranslate2-backed Whisper engine WhisperX uses for stage 1 transcription
- pyannote/pyannote-audio ↗ ↗the speaker diarization library WhisperX uses for stage 3 · CC-BY licensed pretrained models
- pyannote/speaker-diarization-3.1 ↗ ↗gated diarization pipeline · accept terms once, then your HF_TOKEN works for --diarize
- openai/whisper ↗ ↗the reference Whisper implementation · upstream of every CTranslate2/wav2vec2 derivative
- wav2vec2 model docs ↗ ↗the forced-alignment backbone WhisperX uses in stage 2 · multilingual checkpoints available
whisperX vs Whipscribe
| Feature | whisperX | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | Yes | Yes |
| Word timestamps | Yes | Yes |
| Streaming | No | No |
| Languages | 99 | 99 |
| Platforms | Linux, macOS, GPU | Web, API, MCP |
Alternatives to whisperX
Frequently asked about whisperX
Does whisperX do speaker diarization?
Yes — that's the headline feature. whisperX integrates pyannote for 'Speaker 1 / Speaker 2' labeling on top of faster-whisper transcription, producing per-word speaker-attributed output.
Why does whisperX need a HuggingFace token?
pyannote's diarization models are gated on HuggingFace — you accept the terms of use once and get a token. whisperX uses that token at download time. No cost, just an acceptance step.
How accurate are whisperX timestamps?
More accurate than vanilla Whisper at the word level. whisperX runs forced alignment (wav2vec2) over the transcript so word boundaries match the audio — good for subtitles, short-form clips, and speaker-attributed transcripts.
What license is whisperX?
BSD-2-Clause. Note that the pyannote models it downloads have their own terms; read those before commercial deployment.
Does whisperX support streaming?
No. whisperX is batch-only. For streaming ASR, look at Deepgram, Vosk, or whisper.cpp's stream example.
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.