Looking at whisperX? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

whisperX

by Max Bain

Whisper plus word-aligned timestamps plus speaker labels in a single Python CLI. Max Bain's pipeline — faster-whisper for ASR, wav2vec2 forced alignment, pyannote diarization. Pick a recipe.

TL;DR

WhisperX is a three-stage pipeline glued into one CLI and Python API. Stage 1 transcribes with faster-whisper (CTranslate2-backed Whisper, batched decode). Stage 2 runs wav2vec2 forced alignment so every word lands on a real audio boundary, not Whisper's drifted segment-level timestamps. Stage 3 calls pyannote.audio 3.1 for speaker diarization and assigns each aligned word to a speaker label.

Best for anywhere who said what when matters: podcast post-production, interview transcripts, meeting recordings, captioning workflows that need karaoke-style word highlighting, and research datasets with speaker turns. License: BSD-2-Clause on WhisperX itself; the pyannote models it downloads are gated on Hugging Face and have their own terms — read those before commercial deployment. Latest: v3.8.5, paper at arXiv:2303.00747 (INTERSPEECH 2023).

Category
Open source
License
BSD-2-Clause
Stars
★ 21.4k
Last push
2026-04-04
Pricing
free
Platforms
Linux, macOS, GPU

What it is

whisperX combines faster-whisper with forced alignment (wav2vec2) for word-accurate timestamps and pyannote for speaker diarization. If your audio has more than one speaker and you care about proper "Speaker 1 / Speaker 2" labeling, this is the open-source default. BSD-2 licensed.

Best for: Multi-speaker content (podcasts, interviews, meetings) where "who said what" matters.
Watch out for: Requires a HuggingFace token to download pyannote diarization models (gated); heavier first-run setup.

Install / use

pip install whisperx

Use cases · pick the workflow that matches

WhisperX earns its place when alignment and diarization both matter. These are the workflows the README, Discussions, and the paper authors call out as primary fits — each card links the relevant section so you can jump straight to the right recipe.

Podcast post-production
Speaker turns + word timestamps for show notes and chapters

Two-host or guest interviews where you need accurate speaker labels for show notes, transcript pages, and chapter markers. WhisperX gives you per-word speaker attribution out of the box — drop it into a post-prod script and you get a publishable transcript without manual labeling.


Pair --diarize with --min_speakers and --max_speakers when you already know how many voices to expect.
Interview and meeting recordings
Single-channel mic captures with overlapping speech

Field interviews, sales calls, customer-research sessions, internal meetings — typically recorded on one channel with no per-speaker isolation. WhisperX runs VAD-based segmentation before transcribe, then pyannote-3.1 for clustering, which holds up reasonably on 2–6 speakers without per-mic separation.


VAD cuts long silences; batched faster-whisper decodes the chunks; pyannote labels them. Set HF_TOKEN once for diarization.
Subtitles with word highlighting
SRT/VTT with --highlight_words for karaoke captions

Word-by-word karaoke captioning needs every word boundary to match the audio. Whisper's native segment-level timestamps are too coarse — they drift by hundreds of milliseconds. The wav2vec2 alignment pass fixes this, and --highlight_words emits an SRT with the current word boldened per frame.


Output formats: srt, vtt, txt, tsv, json, aud. Use --highlight_words True for karaoke SRTs.
Video editor handoff
Premiere / DaVinci / Final Cut compatible exports

Word-accurate timestamps are the input format text-based video editors expect. Export WhisperX's per-word JSON, then feed it into Premiere's caption track or DaVinci Resolve's subtitle import. Speaker labels survive the round-trip so editors can color-code talking heads on the timeline.


JSON output preserves word-level start/end + speaker fields. tsv is the easiest to inspect by hand.
Research and dataset prep
ASR training data with speaker turns + forced alignment

Building speech datasets, training a downstream model, or running linguistics analysis on conversational corpora. WhisperX is the closest open-source pipeline to a one-shot dataset annotator — Whisper transcript, wav2vec2 alignment for word boundaries, pyannote turns. Cite arXiv:2303.00747.


Per-language --align_model override: WAV2VEC2_ASR_LARGE_LV60K_960H for English, language-specific checkpoints for non-English.
Batch transcription on a GPU box
70× realtime on large-v2 · <8 GB VRAM at beam 5

Throughput-mode on a single GPU. The README's own benchmarks show ~70× realtime on large-v2 with batched faster-whisper and under 8 GB of VRAM at beam_size=5. Drop --compute_type int8 for half the VRAM at a small WER cost, useful on 6–8 GB consumer cards.


Tune --batch_size to fit VRAM (4 on 6 GB, 8 on 8 GB, 16+ on 16 GB). --compute_type int8 for laptops.
The stack underneath: WhisperX is glue. Read faster-whisper for the transcription engine and openai/whisper for the reference model. Streaming is out of scope — try Vosk or Deepgram for that.

Setup recipes · pick one and copy

Three runnable configurations covering the most common WhisperX deployments. All blocks target whisperx 3.8.x with CUDA 12. Replace ${HF_TOKEN} with your own Hugging Face access token from hf.co/settings/tokens ↗.

1Install + first transcript (CLI)

pip install · single command CLI · SRT subtitles in one shot. The fastest way to verify the install works.

# whisperx 3.8.x · Python >=3.9 · CUDA 12 for GPU
pip install whisperx

# transcribe a file, write subtitles
whisperx input.mp3 \
  --model large-v3 \
  --compute_type float16 \
  --batch_size 16 \
  --output_format srt

# CPU-only laptop fallback
whisperx input.mp3 \
  --model medium \
  --device cpu \
  --compute_type int8 \
  --batch_size 4
2With speaker diarization (CLI)

Accept the pyannote model terms on Hugging Face first, then pass --diarize with your HF_TOKEN. Outputs per-word speaker labels.

# 1. Sign in at https://hf.co and accept the user conditions on:
#      https://huggingface.co/pyannote/segmentation-3.0
#      https://huggingface.co/pyannote/speaker-diarization-3.1
# 2. Create a read token at https://hf.co/settings/tokens
# 3. Export it (never hardcode):
export HF_TOKEN="hf_...your_token_here..."

# 4. Transcribe + align + diarize in one pass
whisperx input.mp3 \
  --model large-v3 \
  --diarize \
  --hf_token "${HF_TOKEN}" \
  --min_speakers 2 \
  --max_speakers 4 \
  --output_format srt

# karaoke-style word highlighting in the SRT
whisperx input.mp3 --model large-v3 --diarize \
  --hf_token "${HF_TOKEN}" --highlight_words True --output_format srt
Diarization runs on pyannote/speaker-diarization-3.1 ↗ (gated, CC-BY). Terms of use must be accepted on the model card before the token works.
3Python API · transcribe + align + diarize

Three-stage call: load_modelload_align_model + alignDiarizationPipeline + assign_word_speakers. The pattern from the README.

import os
import whisperx
from whisperx.diarize import DiarizationPipeline

device = "cuda"
batch_size = 16
compute_type = "float16"   # "int8" on low-VRAM GPUs
audio_file = "input.mp3"
HF_TOKEN = os.environ["HF_TOKEN"]   # never hardcode

# 1. Transcribe (faster-whisper under the hood)
model = whisperx.load_model("large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Forced alignment (wav2vec2)
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device,
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

# 3. Diarize + assign speakers to each word
diarize_model = DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)

for seg in result["segments"]:
    speaker = seg.get("speaker", "?")
    print(f"[{seg['start']:6.2f} -> {seg['end']:6.2f}] {speaker}: {seg['text']}")
Source: m-bain/whisperX · Python usage ↗. Returned result["segments"] carries per-word start, end, word, and speaker fields after step 3.

Features

Speaker diarizationYes
Word-level timestampsYes
Streaming / real-timeNo
Languages supported99
HIPAA eligibleNo

Links

whisperX vs Whipscribe

FeaturewhisperXWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationYesYes
Word timestampsYesYes
StreamingNoNo
Languages9999
PlatformsLinux, macOS, GPUWeb, API, MCP

Alternatives to whisperX

Frequently asked about whisperX

Does whisperX do speaker diarization?

Yes — that's the headline feature. whisperX integrates pyannote for 'Speaker 1 / Speaker 2' labeling on top of faster-whisper transcription, producing per-word speaker-attributed output.

Why does whisperX need a HuggingFace token?

pyannote's diarization models are gated on HuggingFace — you accept the terms of use once and get a token. whisperX uses that token at download time. No cost, just an acceptance step.

How accurate are whisperX timestamps?

More accurate than vanilla Whisper at the word level. whisperX runs forced alignment (wav2vec2) over the transcript so word boundaries match the audio — good for subtitles, short-form clips, and speaker-attributed transcripts.

What license is whisperX?

BSD-2-Clause. Note that the pyannote models it downloads have their own terms; read those before commercial deployment.

Does whisperX support streaming?

No. whisperX is batch-only. For streaming ASR, look at Deepgram, Vosk, or whisper.cpp's stream example.

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.