Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
insanely-fast-whisper
Opinionated Python CLI that wraps Whisper-large-v3 + Flash Attention 2 + batched chunking for one job — make an H100 or A100 chew through an hour of audio in under a minute.
An HF-team CLI by Vaibhav Srivastav glueing together Hugging Face Transformers, Flash Attention 2, and chunked batched decode over openai/whisper-large-v3. With --flash True --batch-size 24 on an A100-80GB it transcribes 150 minutes of audio in ~98 seconds; with distil-large-v2 it drops to ~78 seconds.
Best for single-file batch transcription on rented A100 / H100 / RTX 4090 boxes — when you already have a GPU and want the lowest possible wall-clock per hour of audio. Apple Silicon works via --device-id mps but is dramatically slower. License: Apache-2.0. Latest PyPI: 0.0.15 (Python ≥3.8).
What it is
An opinionated CLI wrapper around Hugging Face Transformers + Flash Attention + BetterTransformer. Trades install complexity for throughput: ~150 min of audio in ~98s on an A100. The reference for "how fast can Whisper go on current hardware." Apache-2.0.
Watch out for: Requires an NVIDIA GPU with enough VRAM for Whisper-large-v3 + Flash Attention; CPU path is not practical.
Install / use
pipx install insanely-fast-whisper
Deployment targets · 5 runtime cards
insanely-fast-whisper is one CLI, but the operational profile changes per box. Each card links the canonical README section for that runtime — VRAM expectations, the flags that matter, and what to skip when the hardware can't take Flash Attention 2.
The reference deployment. Run with --flash True --batch-size 24 on A100-80GB to hit 150 min of audio in ~98 seconds at fp16. H100 with FlashAttention-3 builds gets you further but the CLI defaults are tuned for A100 and that is what the README benchmarks.
Default model openai/whisper-large-v3 · ~6 GB VRAM at fp16 + activations · batch-size 24 default · pip install flash-attn separately before passing --flash True
Flash Attention 2 works on Ada (RTX 4090) and Ampere (RTX 3090) cards. Drop the batch size — 8 to 16 is realistic on 24 GB once the model and activations are resident. distil-large-v2 buys you headroom at near-parity English WER.
Suggested: --model-name distil-whisper/distil-large-v2 --batch-size 8 --flash True · expect ~2-3x slower than A100-80GB but still order-of-magnitude faster than reference openai-whisper
Pass --device-id mps to route through PyTorch MPS instead of CUDA. No Flash Attention path on Apple Silicon — drop --flash. Usable for one-off transcripts, not for batch backlogs. For local Mac speed prefer whisper.cpp or WhisperKit.
insanely-fast-whisper --file-name input.mp3 --device-id mps · ignore --flash · expect minutes-not-seconds per hour of audio
pipx run insanely-fast-whisper --file-name
pipx run insanely-fast-whisper==0.0.15 --file-name audio.mp3 --flash True --batch-size 24 · transcript written to ./output.json by default
If you want programmatic control — your own batching, your own output schema, integration with diarization or alignment — skip the CLI and call the transformers ASR pipeline that insanely-fast-whisper wraps. Same speed, same model, same Flash Attention 2 path.
pipeline('automatic-speech-recognition', model='openai/whisper-large-v3', torch_dtype=torch.float16, device='cuda:0', model_kwargs={'attn_implementation': 'flash_attention_2'})
--flash True only works after pip install flash-attn --no-build-isolation succeeds, and that itself only works on Ampere / Ada / Hopper. For everything else use faster-whisper (CTranslate2, no flash-attn build), whisper.cpp (CPU / Apple Silicon), or distil-whisper directly.Setup recipes · pick one and copy
Three configurations covering the most common insanely-fast-whisper deployments. Verified against PyPI release 0.0.15 (2024-05-27) and the current README.
pipx install · single audio file · no extras. The default path when you just want to see it work.
# insanely-fast-whisper 0.0.15 · Python >=3.8
pipx install insanely-fast-whisper
# Transcribe a local file or a URL.
# Writes output.json (chunked timestamps + text) in cwd.
insanely-fast-whisper --file-name input.mp3
# Pin the version inside CI / scripts:
# pipx install insanely-fast-whisper==0.0.15 --force
# Python 3.11.x and pipx complains about requires-python? Use:
# pipx install insanely-fast-whisper --force \
# --pip-args="--ignore-requires-python"
Install flash-attn separately, then pass --flash True --batch-size 24. This is the configuration the headline ~98s/150min number is measured against.
# Step 1 — build flash-attn against your CUDA stack.
# Ampere (A100), Ada (RTX 4090), or Hopper (H100) only.
pip install flash-attn --no-build-isolation
# Step 2 — install the CLI itself.
pipx install insanely-fast-whisper
# Step 3 — run with Flash Attention 2 + batched chunks.
insanely-fast-whisper \
--file-name podcast.mp3 \
--model-name openai/whisper-large-v3 \
--batch-size 24 \
--flash True \
--timestamp chunk \
--transcript-path output.json
# distil-whisper variant — ~20% faster again, English-focused:
# --model-name distil-whisper/distil-large-v2
Drop the CLI and call the underlying transformers pipeline directly, then add speaker labels via pyannote.audio. Requires a Hugging Face access token.
# pip install transformers torch pyannote.audio
import torch
from transformers import pipeline
from pyannote.audio import Pipeline as Diarizer
asr = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3",
torch_dtype=torch.float16,
device="cuda:0",
model_kwargs={"attn_implementation": "flash_attention_2"},
)
out = asr(
"meeting.wav",
chunk_length_s=30,
batch_size=24,
return_timestamps=True,
)
# pyannote needs a HF token + accepted gated model conditions.
diar = Diarizer.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="<HF_TOKEN>",
).to(torch.device("cuda"))
speakers = diar("meeting.wav")
# Merge ASR chunks with speaker turns by overlap on the timeline.
--hf-token + --diarization_model path if you want diarization in one shot.Features
| Speaker diarization | Yes |
| Word-level timestamps | Yes |
| Streaming / real-time | No |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- Vaibhavs10/insanely-fast-whisper ↗ ↗main repo · README owns the benchmark table, CLI flag reference, and FAQ
- PyPI · insanely-fast-whisper 0.0.15 ↗ ↗latest release 2024-05-27 · Python ≥3.8 · pip or pipx
- transformers · Whisper docs ↗ ↗the upstream model the CLI wraps · pipeline() signature + WhisperForConditionalGeneration
- Dao-AILab/flash-attention ↗ ↗the --flash True dependency · install with pip install flash-attn --no-build-isolation · Ampere / Ada / Hopper only
- pyannote/pyannote-audio ↗ ↗speaker diarization · MIT-licensed · gated HF models require an access token and accepted user conditions
- distil-whisper/distil-large-v2 ↗ ↗the alternate --model-name in the README's fastest configuration · ~78s for 150min on A100 in benchmarks
- HF blog · chunking long-form audio for ASR ↗ ↗background on the chunked-batched-decode strategy this CLI bakes in
- Community showcase ↗ ↗downstream projects built on top of the CLI · diarization wrappers, web UIs, lambda packagings
insanely-fast-whisper vs Whipscribe
| Feature | insanely-fast-whisper | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | Yes | Yes |
| Word timestamps | Yes | Yes |
| Streaming | No | No |
| Languages | 99 | 99 |
| Platforms | Linux, GPU | Web, API, MCP |
Alternatives to insanely-fast-whisper
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.