Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
whisper.cpp
Pure-C/C++ port of Whisper. No PyTorch, no CUDA dependency, ggml-format quantized weights. Runs on Apple Silicon, NVIDIA, Linux CPU, Raspberry Pi, iOS, Android, and the browser via WASM.
Georgi Gerganov's dependency-free C/C++ reimplementation of Whisper, built on the same ggml tensor library that powers llama.cpp. Weights ship as quantized ggml-format .bin files (fp16, q8_0, q5_1, q5_0, q4_0) — a 1.5 GB large-v3 shrinks to ~1.1 GB at q5_0 with negligible WER loss. Backends span Metal (default on Apple Silicon), Core ML via Apple Neural Engine, CUDA, Vulkan, OpenBLAS, ARM NEON, AVX/AVX2, and WebAssembly for in-browser inference.
Best for offline / on-device transcription where you do not want a Python runtime in the loop — Mac desktop apps, NVIDIA edge boxes, Pi-class ARM hardware, iOS/Android shipping models inside the binary, and privacy-preserving browser demos. License: MIT. Upstream repo moved from ggerganov/whisper.cpp to ggml-org/whisper.cpp; the old URL redirects.
What it is
whisper.cpp is a dependency-free C/C++ port of Whisper. No PyTorch, no CUDA — it runs everywhere and is the fastest Whisper option on Apple Silicon thanks to the Metal backend. The project is also the upstream of the popular llama.cpp approach. Perfect when you need privacy-preserving, offline transcription on consumer hardware.
Watch out for: No speaker diarization out of the box; model management is manual; diarization needs external pyannote.
Install / use
git clone https://github.com/ggerganov/whisper.cpp && make
Pick a runtime · 6 platforms
whisper.cpp's differentiating story is reach. The model file is the same ggml .bin everywhere — what changes is the backend you build against. Each card links the canonical example folder in the upstream repo so you can read the actual project files before committing to a stack.
Metal is enabled by default when you build on Apple Silicon — no flag needed. For an extra step add Core ML and the encoder runs on the Apple Neural Engine, which lifts throughput on M1/M2/M3/M4 by roughly 2-3x on small/base/medium models. Build with cmake -B build -DWHISPER_COREML=1.
ggml-large-v3 + Core ML encoder on M2 Pro · realtime factor well under 1x for most podcast audio.
Build with cmake -B build -DGGML_CUDA=1 and the encoder + decoder run on cuBLAS kernels. Pair with q5_0 or q8_0 quantized weights to fit large-v3 on consumer 6-8 GB cards. The repo's CUDA Dockerfile is the simplest reproducible path for an Ubuntu box.
Production tier · materially slower than faster-whisper on the same GPU, but no Python in the runtime.
Default build is pthreads + AVX/AVX2; add cmake -B build -DGGML_BLAS=1 for an OpenBLAS speedup, or -DWHISPER_OPENVINO=1 for Intel's OpenVINO encoder path on Xeon / Core CPUs. Quantized models (q5_0, q4_0) make small + base usable for batch jobs on a 4-core VM.
ggml-base.en at q5_0 · ~140 MB on disk · runs well on a 2-vCPU cloud box.
Two paths: build natively with MSVC + cmake for a CPU/AVX2 binary, or pass cmake -B build -DGGML_VULKAN=1 to get a single binary that targets NVIDIA, AMD, and Intel Arc GPUs through Vulkan. Vulkan is the path of least resistance when you don't want to ship a CUDA SDK to end users.
Vulkan backend is community-driven · check the issues tracker before targeting it for production.
NEON SIMD is on by default for ARM64 builds. The tiny and base models run usefully on a Pi 4 (4 GB) or Pi 5; small is tractable with patience. The repo's bench example is the quickest way to read realtime-factor on your specific board before designing around it.
Pi 5 · ggml-base.en at q5_0 · roughly realtime for short clips.
The whisper.wasm example compiles the encoder + decoder to WebAssembly with SIMD enabled. Load the model file in-browser (or stream it from your own origin), call whisper_full() from JS, and you have a serverless transcription demo. The official live demo runs tiny + base in the browser today.
WASM + SIMD · tiny / base are the realistic ceiling for in-browser inference.
Setup recipes · pick one and copy
Three runnable configurations covering the most common whisper.cpp deployments. Commands verified against the current master branch and the homebrew formula.
Fastest path on a Mac. Homebrew ships a prebuilt whisper-cli; you only need to fetch a ggml model.
# macOS · Homebrew
brew install whisper-cpp
# fetch a ggml model from the official HF mirror
# (the helper script lives in the source tree)
git clone https://github.com/ggml-org/whisper.cpp
./whisper.cpp/models/download-ggml-model.sh base.en
# transcribe
whisper-cli \
-m whisper.cpp/models/ggml-base.en.bin \
-f audio.wav \
-otxt -ovtt -osrt
# outputs: audio.wav.txt audio.wav.vtt audio.wav.srt
ffmpeg -i in.mp3 -ar 16000 -ac 1 out.wav.Use this when you want the latest commit, Core ML encoder, or a CUDA build. cmake driver, no Make.
# clone + cmake
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
# Apple Silicon: Metal is on by default.
# Add Core ML (ANE encoder) for an extra 2-3x on small/base/medium.
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release
# NVIDIA: build with CUDA instead.
# cmake -B build -DGGML_CUDA=1
# cmake --build build -j --config Release
# fetch + run
./models/download-ggml-model.sh large-v3-turbo
./build/bin/whisper-cli \
-m models/ggml-large-v3-turbo.bin \
-f samples/jfk.wav \
-t 8 -p 1
.mlmodelc via models/generate-coreml-model.sh. For CUDA arch overrides see README · NVIDIA GPU support ↗.Shrink large-v3 from ~3 GB to ~1.1 GB (q5_0) with minimal WER loss. Same trick for tiny/base/small on a Pi.
# from inside the built whisper.cpp tree
./models/download-ggml-model.sh medium
# quantize fp16 -> q5_0 (5-bit weights, ~3.4x smaller)
./build/bin/quantize \
models/ggml-medium.bin \
models/ggml-medium-q5_0.bin \
q5_0
# run on CPU with 4 threads
./build/bin/whisper-cli \
-m models/ggml-medium-q5_0.bin \
-f audio.wav \
-t 4 -p 1 -otxt
q4_0, q4_1, q5_0, q5_1, q8_0. Pre-quantized .bin files also live on the HF model card ↗ if you'd rather not build the quantize binary.Features
| Speaker diarization | No |
| Word-level timestamps | Yes |
| Streaming / real-time | Yes |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- ggml-org/whisper.cpp ↗ ↗main repo · README has the build matrix, backend flags, and benchmark tables
- whisper.cpp · examples/ ↗ ↗canonical project folders: cli, server, stream, wasm, swiftui, android, objc, talk-llama, vad-speech-segments
- ggml-org/ggml ↗ ↗the underlying tensor library · same project that backs llama.cpp
- huggingface.co/ggerganov/whisper.cpp ↗ ↗canonical ggml model card · tiny through large-v3-turbo, fp16 and quantized variants
- examples/whisper.swiftui ↗ ↗reference iOS / macOS app · drop-in starting point for shipping whisper.cpp inside a Swift binary
- examples/whisper.android ↗ ↗reference Android app · JNI bindings + Gradle project
- examples/whisper.wasm ↗ ↗in-browser inference via Emscripten · the project also hosts a live WASM demo linked from the repo README
- whisper.cpp · Discussions ↗ ↗the practical Q&A surface for backend / build / model issues — read here before opening an issue
- formulae.brew.sh · whisper-cpp ↗ ↗Homebrew formula · ships the whisper-cli binary; ggml model files still need to be fetched separately
whisper.cpp vs Whipscribe
| Feature | whisper.cpp | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | Yes | Yes |
| Streaming | Yes | No |
| Languages | 99 | 99 |
| Platforms | macOS, Linux, Windows, iOS, Android, Edge | Web, API, MCP |
Alternatives to whisper.cpp
Frequently asked about whisper.cpp
Does whisper.cpp work on Apple Silicon?
Yes — whisper.cpp is one of the fastest Whisper options on M-series Macs thanks to its Metal backend. Build with the Metal flag enabled and the model runs on the GPU without PyTorch or CUDA.
Do I need a GPU to use whisper.cpp?
No. whisper.cpp is CPU-first and runs on laptops, Raspberry Pis, and phones. On Apple Silicon it also uses Metal; on Nvidia it can use cuBLAS; on x86 it uses AVX/AVX2. A GPU helps but isn't required.
Does whisper.cpp support diarization?
Not out of the box. It outputs text + segment timestamps only. For speaker labels, feed the audio through pyannote separately or use whisperX, which bundles diarization with a similar runtime core.
How do I download the model files?
The repo includes a models/download-ggml-model.sh script. Pick a size (tiny/base/small/medium/large-v3) based on your RAM/CPU budget. Larger models = better accuracy, more memory.
Does whisper.cpp support streaming?
Yes. The stream example in the repo shows live microphone transcription. Latency depends on model size and hardware.
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.