Looking at insanely-fast-whisper? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

insanely-fast-whisper

Name: insanely-fast-whisper
Author: Vaibhav Srivastav

by Vaibhav Srivastav

Opinionated Python CLI that wraps Whisper-large-v3 + Flash Attention 2 + batched chunking for one job — make an H100 or A100 chew through an hour of audio in under a minute.

TL;DR

An HF-team CLI by Vaibhav Srivastav glueing together Hugging Face Transformers, Flash Attention 2, and chunked batched decode over openai/whisper-large-v3. With --flash True --batch-size 24 on an A100-80GB it transcribes 150 minutes of audio in ~98 seconds; with distil-large-v2 it drops to ~78 seconds.

Best for single-file batch transcription on rented A100 / H100 / RTX 4090 boxes — when you already have a GPU and want the lowest possible wall-clock per hour of audio. Apple Silicon works via --device-id mps but is dramatically slower. License: Apache-2.0. Latest PyPI: 0.0.15 (Python ≥3.8).

What it is

An opinionated CLI wrapper around Hugging Face Transformers + Flash Attention + BetterTransformer. Trades install complexity for throughput: ~150 min of audio in ~98s on an A100. The reference for "how fast can Whisper go on current hardware." Apache-2.0.

Best for: Batch-processing huge backlogs on rented H100/A100 time.
Watch out for: Requires an NVIDIA GPU with enough VRAM for Whisper-large-v3 + Flash Attention; CPU path is not practical.

Install / use

pipx install insanely-fast-whisper

Deployment targets · 5 runtime cards

insanely-fast-whisper is one CLI, but the operational profile changes per box. Each card links the canonical README section for that runtime — VRAM expectations, the flags that matter, and what to skip when the hardware can't take Flash Attention 2.

NVIDIA A100 / H100

Datacenter GPUs · the headline numbers

The reference deployment. Run with --flash True --batch-size 24 on A100-80GB to hit 150 min of audio in ~98 seconds at fp16. H100 with FlashAttention-3 builds gets you further but the CLI defaults are tuned for A100 and that is what the README benchmarks.

Default model openai/whisper-large-v3 · ~6 GB VRAM at fp16 + activations · batch-size 24 default · pip install flash-attn separately before passing --flash True

Consumer RTX (4090 / 3090)

24 GB Ada / Ampere · lower batch size

Flash Attention 2 works on Ada (RTX 4090) and Ampere (RTX 3090) cards. Drop the batch size — 8 to 16 is realistic on 24 GB once the model and activations are resident. distil-large-v2 buys you headroom at near-parity English WER.

Apple Silicon (MPS)

M-series Macs · CLI works, slow

Pass --device-id mps to route through PyTorch MPS instead of CUDA. No Flash Attention path on Apple Silicon — drop --flash. Usable for one-off transcripts, not for batch backlogs. For local Mac speed prefer whisper.cpp or WhisperKit.

insanely-fast-whisper --file-name input.mp3 --device-id mps · ignore --flash · expect minutes-not-seconds per hour of audio

Docker / one-shot run

pipx run · no install, no venv

pipx run insanely-fast-whisper --file-name downloads the package, transcribes, and exits — useful inside CI containers or on a borrowed GPU box where you do not want a persistent install. Pin --force a version for reproducibility.

pipx run insanely-fast-whisper==0.0.15 --file-name audio.mp3 --flash True --batch-size 24 · transcript written to ./output.json by default

Python library (no CLI)

Use the underlying transformers pipeline directly

If you want programmatic control — your own batching, your own output schema, integration with diarization or alignment — skip the CLI and call the transformers ASR pipeline that insanely-fast-whisper wraps. Same speed, same model, same Flash Attention 2 path.

pipeline('automatic-speech-recognition', model='openai/whisper-large-v3', torch_dtype=torch.float16, device='cuda:0', model_kwargs={'attn_implementation': 'flash_attention_2'})

Picking flags: the README's CLI table is the source of truth — --flash True only works after pip install flash-attn --no-build-isolation succeeds, and that itself only works on Ampere / Ada / Hopper. For everything else use faster-whisper (CTranslate2, no flash-attn build), whisper.cpp (CPU / Apple Silicon), or distil-whisper directly.

Setup recipes · pick one and copy

Three configurations covering the most common insanely-fast-whisper deployments. Verified against PyPI release 0.0.15 (2024-05-27) and the current README.

1Install + first transcript

pipx install · single audio file · no extras. The default path when you just want to see it work.

# insanely-fast-whisper 0.0.15 · Python >=3.8
pipx install insanely-fast-whisper

# Transcribe a local file or a URL.
# Writes output.json (chunked timestamps + text) in cwd.
insanely-fast-whisper --file-name input.mp3

# Pin the version inside CI / scripts:
# pipx install insanely-fast-whisper==0.0.15 --force

# Python 3.11.x and pipx complains about requires-python? Use:
# pipx install insanely-fast-whisper --force \
#   --pip-args="--ignore-requires-python"

Source: Vaibhavs10/insanely-fast-whisper · README ↗. PyPI: insanely-fast-whisper 0.0.15 ↗

2Flash Attention 2 + batched decode (A100 / H100)

Install flash-attn separately, then pass --flash True --batch-size 24. This is the configuration the headline ~98s/150min number is measured against.

# Step 1 — build flash-attn against your CUDA stack.
# Ampere (A100), Ada (RTX 4090), or Hopper (H100) only.
pip install flash-attn --no-build-isolation

# Step 2 — install the CLI itself.
pipx install insanely-fast-whisper

# Step 3 — run with Flash Attention 2 + batched chunks.
insanely-fast-whisper \
  --file-name podcast.mp3 \
  --model-name openai/whisper-large-v3 \
  --batch-size 24 \
  --flash True \
  --timestamp chunk \
  --transcript-path output.json

# distil-whisper variant — ~20% faster again, English-focused:
# --model-name distil-whisper/distil-large-v2

Flash Attention 2 install: Dao-AILab/flash-attention ↗ · benchmark table: README benchmarks ↗.

3Python API + diarization with pyannote

Drop the CLI and call the underlying transformers pipeline directly, then add speaker labels via pyannote.audio. Requires a Hugging Face access token.

# pip install transformers torch pyannote.audio
import torch
from transformers import pipeline
from pyannote.audio import Pipeline as Diarizer

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda:0",
    model_kwargs={"attn_implementation": "flash_attention_2"},
)

out = asr(
    "meeting.wav",
    chunk_length_s=30,
    batch_size=24,
    return_timestamps=True,
)

# pyannote needs a HF token + accepted gated model conditions.
diar = Diarizer.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="<HF_TOKEN>",
).to(torch.device("cuda"))

speakers = diar("meeting.wav")
# Merge ASR chunks with speaker turns by overlap on the timeline.

Pipeline API ref: transformers · Whisper docs ↗. CLI also ships a --hf-token + --diarization_model path if you want diarization in one shot.

Features

Speaker diarization	Yes
Word-level timestamps	Yes
Streaming / real-time	No
Languages supported	99
HIPAA eligible	No