Looking at whisper.cpp? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

whisper.cpp

Name: whisper.cpp
Author: Georgi Gerganov

by Georgi Gerganov

Pure-C/C++ port of Whisper. No PyTorch, no CUDA dependency, ggml-format quantized weights. Runs on Apple Silicon, NVIDIA, Linux CPU, Raspberry Pi, iOS, Android, and the browser via WASM.

TL;DR

Georgi Gerganov's dependency-free C/C++ reimplementation of Whisper, built on the same ggml tensor library that powers llama.cpp. Weights ship as quantized ggml-format .bin files (fp16, q8_0, q5_1, q5_0, q4_0) — a 1.5 GB large-v3 shrinks to ~1.1 GB at q5_0 with negligible WER loss. Backends span Metal (default on Apple Silicon), Core ML via Apple Neural Engine, CUDA, Vulkan, OpenBLAS, ARM NEON, AVX/AVX2, and WebAssembly for in-browser inference.

Best for offline / on-device transcription where you do not want a Python runtime in the loop — Mac desktop apps, NVIDIA edge boxes, Pi-class ARM hardware, iOS/Android shipping models inside the binary, and privacy-preserving browser demos. License: MIT. Upstream repo moved from ggerganov/whisper.cpp to ggml-org/whisper.cpp; the old URL redirects.

What it is

whisper.cpp is a dependency-free C/C++ port of Whisper. No PyTorch, no CUDA — it runs everywhere and is the fastest Whisper option on Apple Silicon thanks to the Metal backend. The project is also the upstream of the popular llama.cpp approach. Perfect when you need privacy-preserving, offline transcription on consumer hardware.

Best for: Offline / on-device transcription, Apple Silicon Metal acceleration, low-RAM targets.
Watch out for: No speaker diarization out of the box; model management is manual; diarization needs external pyannote.

Install / use

git clone https://github.com/ggerganov/whisper.cpp && make

Pick a runtime · 6 platforms

whisper.cpp's differentiating story is reach. The model file is the same ggml .bin everywhere — what changes is the backend you build against. Each card links the canonical example folder in the upstream repo so you can read the actual project files before committing to a stack.

Apple Silicon · Metal + Core ML

Default backend on M-series Macs

Metal is enabled by default when you build on Apple Silicon — no flag needed. For an extra step add Core ML and the encoder runs on the Apple Neural Engine, which lifts throughput on M1/M2/M3/M4 by roughly 2-3x on small/base/medium models. Build with cmake -B build -DWHISPER_COREML=1.

ggml-large-v3 + Core ML encoder on M2 Pro · realtime factor well under 1x for most podcast audio.

NVIDIA · CUDA / cuBLAS

Server-class GPU inference

Build with cmake -B build -DGGML_CUDA=1 and the encoder + decoder run on cuBLAS kernels. Pair with q5_0 or q8_0 quantized weights to fit large-v3 on consumer 6-8 GB cards. The repo's CUDA Dockerfile is the simplest reproducible path for an Ubuntu box.

Production tier · materially slower than faster-whisper on the same GPU, but no Python in the runtime.

Linux CPU · OpenBLAS / OpenVINO

Headless servers and CI runners

Default build is pthreads + AVX/AVX2; add cmake -B build -DGGML_BLAS=1 for an OpenBLAS speedup, or -DWHISPER_OPENVINO=1 for Intel's OpenVINO encoder path on Xeon / Core CPUs. Quantized models (q5_0, q4_0) make small + base usable for batch jobs on a 4-core VM.

ggml-base.en at q5_0 · ~140 MB on disk · runs well on a 2-vCPU cloud box.

Windows · Vulkan or MSVC

Cross-vendor GPU on consumer Windows

Two paths: build natively with MSVC + cmake for a CPU/AVX2 binary, or pass cmake -B build -DGGML_VULKAN=1 to get a single binary that targets NVIDIA, AMD, and Intel Arc GPUs through Vulkan. Vulkan is the path of least resistance when you don't want to ship a CUDA SDK to end users.

Vulkan backend is community-driven · check the issues tracker before targeting it for production.

Raspberry Pi · ARM NEON

Edge inference on Pi 4 / Pi 5

NEON SIMD is on by default for ARM64 builds. The tiny and base models run usefully on a Pi 4 (4 GB) or Pi 5; small is tractable with patience. The repo's bench example is the quickest way to read realtime-factor on your specific board before designing around it.

Pi 5 · ggml-base.en at q5_0 · roughly realtime for short clips.

Browser · WebAssembly

In-page transcription · no server

The whisper.wasm example compiles the encoder + decoder to WebAssembly with SIMD enabled. Load the model file in-browser (or stream it from your own origin), call whisper_full() from JS, and you have a serverless transcription demo. The official live demo runs tiny + base in the browser today.

WASM + SIMD · tiny / base are the realistic ceiling for in-browser inference.

iOS + Android get first-class bindings via the whisper.swiftui and whisper.android example apps. For Python-runtime alternatives see faster-whisper (CTranslate2, faster on NVIDIA) and openai/whisper (the reference implementation whisper.cpp tracks).

Setup recipes · pick one and copy

Three runnable configurations covering the most common whisper.cpp deployments. Commands verified against the current master branch and the homebrew formula.

1macOS · brew install + first transcript

Fastest path on a Mac. Homebrew ships a prebuilt whisper-cli; you only need to fetch a ggml model.

# macOS · Homebrew
brew install whisper-cpp

# fetch a ggml model from the official HF mirror
# (the helper script lives in the source tree)
git clone https://github.com/ggml-org/whisper.cpp
./whisper.cpp/models/download-ggml-model.sh base.en

# transcribe
whisper-cli \
  -m whisper.cpp/models/ggml-base.en.bin \
  -f audio.wav \
  -otxt -ovtt -osrt
# outputs: audio.wav.txt audio.wav.vtt audio.wav.srt

Brew formula: whisper-cpp ↗. Audio must be 16 kHz mono WAV — convert with ffmpeg -i in.mp3 -ar 16000 -ac 1 out.wav.

2Build from source · Metal (Mac) or CUDA (NVIDIA)

Use this when you want the latest commit, Core ML encoder, or a CUDA build. cmake driver, no Make.

# clone + cmake
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp

# Apple Silicon: Metal is on by default.
# Add Core ML (ANE encoder) for an extra 2-3x on small/base/medium.
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

# NVIDIA: build with CUDA instead.
# cmake -B build -DGGML_CUDA=1
# cmake --build build -j --config Release

# fetch + run
./models/download-ggml-model.sh large-v3-turbo
./build/bin/whisper-cli \
  -m models/ggml-large-v3-turbo.bin \
  -f samples/jfk.wav \
  -t 8 -p 1

Core ML encoder requires generating the .mlmodelc via models/generate-coreml-model.sh. For CUDA arch overrides see README · NVIDIA GPU support ↗.

3Quantize a model for edge / Pi

Shrink large-v3 from ~3 GB to ~1.1 GB (q5_0) with minimal WER loss. Same trick for tiny/base/small on a Pi.

# from inside the built whisper.cpp tree
./models/download-ggml-model.sh medium

# quantize fp16 -> q5_0 (5-bit weights, ~3.4x smaller)
./build/bin/quantize \
  models/ggml-medium.bin \
  models/ggml-medium-q5_0.bin \
  q5_0

# run on CPU with 4 threads
./build/bin/whisper-cli \
  -m models/ggml-medium-q5_0.bin \
  -f audio.wav \
  -t 4 -p 1 -otxt

Supported quant types: q4_0, q4_1, q5_0, q5_1, q8_0. Pre-quantized .bin files also live on the HF model card ↗ if you'd rather not build the quantize binary.

Features

Speaker diarization	No
Word-level timestamps	Yes
Streaming / real-time	Yes
Languages supported	99
HIPAA eligible	No

whisper.cpp vs Whipscribe

Feature	whisper.cpp	Whipscribe
Category	Open source	Transcription APIs
Pricing	free	free beta
Speaker diarization	No	Yes
Word timestamps	Yes	Yes
Streaming	Yes	No
Languages	99	99
Platforms	macOS, Linux, Windows, iOS, Android, Edge	Web, API, MCP

Alternatives to whisper.cpp

OpenAI Whisper

OpenAI

The reference open-source multilingual ASR model from OpenAI.

OSS · MIT ★ 98.1k

faster-whisper

SYSTRAN

4× faster than reference Whisper using CTranslate2 — production sweet spot.

OSS · MIT ★ 22.3k

whisperX

Max Bain

Faster-whisper + forced alignment + speaker diarization in one pipeline.

OSS · BSD‑2‑Clause ★ 21.4k

Frequently asked about whisper.cpp

Does whisper.cpp work on Apple Silicon?

Yes — whisper.cpp is one of the fastest Whisper options on M-series Macs thanks to its Metal backend. Build with the Metal flag enabled and the model runs on the GPU without PyTorch or CUDA.

Do I need a GPU to use whisper.cpp?

No. whisper.cpp is CPU-first and runs on laptops, Raspberry Pis, and phones. On Apple Silicon it also uses Metal; on Nvidia it can use cuBLAS; on x86 it uses AVX/AVX2. A GPU helps but isn't required.

Does whisper.cpp support diarization?

Not out of the box. It outputs text + segment timestamps only. For speaker labels, feed the audio through pyannote separately or use whisperX, which bundles diarization with a similar runtime core.

How do I download the model files?

The repo includes a models/download-ggml-model.sh script. Pick a size (tiny/base/small/medium/large-v3) based on your RAM/CPU budget. Larger models = better accuracy, more memory.

Does whisper.cpp support streaming?

Yes. The stream example in the repo shows live microphone transcription. Latency depends on model size and hardware.

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.