Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
WhisperKit
Argmax's MIT-licensed Swift SDK that runs Whisper natively on Apple Silicon. CoreML-quantized weights schedule across the Apple Neural Engine, GPU, and CPU automatically — no PyTorch, no CUDA, no server hop.
WhisperKit is the speech-to-text core of the Argmax Open-Source SDK (v1.0.0, May 2026) — a Swift Package that ships pre-converted CoreML Whisper weights and an async API designed for Apple platforms. The OS schedules layers across the Apple Neural Engine, GPU (Metal), and CPU per call, so you get hardware acceleration that whisper.cpp only reaches with extra build flags and a separate encoder-conversion step. The same SPM dependency also exposes sibling kits SpeakerKit (pyannote diarization) and TTSKit (Qwen3-TTS).
Best for Apple-app developers shipping on-device voice features — dictation, accessibility captions, in-game voice, offline note-takers, HIPAA-sensitive transcription — where a Python runtime or a network round-trip is a non-starter. Supported targets: macOS 14+ and iOS 17+ per the v1.0.0 README; sister kits TTSKit (macOS 15 / iOS 18) and SpeakerKit (macOS 13 / iOS 16) have their own floors. License: MIT. Repo: argmaxinc/argmax-oss-swift (the legacy argmaxinc/WhisperKit URL redirects).
What it is
WhisperKit is Argmax's Swift-native Whisper runtime for Apple Silicon. CoreML-compiled encoder + decoder run on the Neural Engine, GPU, and CPU automatically — no PyTorch, no Python, no CUDA, and no manual conversion step. As of v1.0.0 (2026-05-01) the repo was renamed from argmaxinc/WhisperKit to argmaxinc/argmax-oss-swift and now ships WhisperKit + SpeakerKit (pyannote diarization) + TTSKit (Qwen3-TTS) in a single MIT-licensed Swift package. The whisperkit-cli (Homebrew + `swift run`), an OpenAI-compatible local server, and 27+ pre-converted model variants on HuggingFace make it the default choice for any developer who wants Whisper-grade transcription on Apple platforms without running a backend.
Watch out for: Apple ecosystem only; diarization now sits in a sibling kit (SpeakerKit) rather than in WhisperKit itself; word-level forced alignment is not in the open-source surface.
Install / use
View on GitHub (Argmax Open-Source SDK) github.comAdd via Swift Package Manager in Xcode: File → Add Package Dependencies…
Pick your Apple target · 5 surfaces
WhisperKit targets the full Apple stack from one Swift package. The integration story differs by surface — what changes is the example app you start from, the practical model size, and which compute units the OS picks. Each card links the canonical project folder or doc page in the upstream repo.
Add the Swift Package, import WhisperKit, and call transcribe from an actor or task. The CoreML scheduler routes the heavy attention layers to the Apple Neural Engine, so a SwiftUI dictation app stays responsive on an A17 Pro or M-series iPad. Practical model ceiling tracks RAM — tiny and base run on older iPhones, quantized large-v3 variants (547-626MB) target iPhone 15 Pro and newer with 8GB.
Pin the model with WhisperKitConfig(model: large-v3-v20240930_626MB) to ship a deterministic asset bundle instead of relying on lazy first-run download.
WhisperKit is the SDK behind the modern crop of Mac dictation apps (Superwhisper, Wispr-class tools). Async transcribe + system-wide hotkey + a small CoreML model gives type-as-you-speak latency on any M-series Mac. Argmax does not ship a flagship end-user app — third parties do — so building a competitive Mac voice utility is a tractable weekend project on top of WhisperKit.
M2 Ultra hits roughly 42x realtime on large-v3-turbo (ANE only) per the upstream benchmarks; M-series laptops still clear realtime by a wide margin.
Vision Pro has the RAM and the M-series silicon to run the full large-v3 comfortably, which makes WhisperKit the natural choice for on-device live captioning, immersive meeting tools, and accessibility overlays inside visionOS apps. Build target: same Swift package, no separate framework. v1.0.0 ships TTSKit alongside, so a captions-plus-readback pipeline lives in one dependency.
visionOS is a first-class target in the umbrella ArgmaxOSS product but verify your minimum deployment against the active v1.x README before shipping to TestFlight.
Treat watchOS as a feasibility target, not a primary one. The Watch's tighter memory budget means tiny and base are the only realistic Whisper variants, and on-device transcription is usually a fallback path while audio also offloads to the paired iPhone. The legacy WhisperKit README listed watchOS 10+ in its platform matrix; the v1.0.0 README leads with macOS + iOS, so test against the current Package.swift before committing.
Open-source surface tracks the package's stated minimums — watch the Examples folder for a watchOS sample app before promising watch-native transcription in a launch.
Two flavors. Run whisperkit-cli transcribe from the shell for batch jobs, or run whisperkit-cli serve to spin up an OpenAI-compatible local server (POST /v1/audio/transcriptions, /v1/audio/translations, SSE streaming). The serve mode is the easiest way to drop WhisperKit into an existing Python / Node app on a Mac without writing any Swift — point the openai SDK's base_url at localhost.
Streaming microphone is exposed via the --stream flag on transcribe; partial results emit as the audio comes in.
ArgmaxOSS target. For non-Apple targets see whisper.cpp (Linux / Windows / Android / WASM) and faster-whisper (CUDA server-class GPUs).Setup recipes · pick your platform
Three recipes covering the most common WhisperKit integrations. Verified against v1.0.0 of the Argmax Open-Source SDK and the current README.
Xcode File → Add Package Dependencies, paste the URL, depend on the WhisperKit product. Five-line transcribe.
// Package.swift
dependencies: [
.package(
url: "https://github.com/argmaxinc/argmax-oss-swift.git",
from: "1.0.0"
),
],
.target(
name: "YourApp",
dependencies: [
// Just the STT kit:
.product(name: "WhisperKit", package: "argmax-oss-swift"),
// Or import the full SDK (WhisperKit + SpeakerKit + TTSKit):
// .product(name: "ArgmaxOSS", package: "argmax-oss-swift"),
]
)
// Anywhere in your code:
import WhisperKit
Task {
let pipe = try await WhisperKit()
let result = try await pipe.transcribe(
audioPath: "path/to/audio.m4a"
)
print(result?.text ?? "")
}
WhisperKit(WhisperKitConfig(model: "large-v3-v20240930_626MB")). Full model list: huggingface.co/argmaxinc/whisperkit-coreml ↗.Start from the in-repo WhisperAX sample app — it wires AVAudioEngine into WhisperKit with partial-result emission.
// Sketch · clone of WhisperAX's streaming loop
import WhisperKit
import AVFoundation
@MainActor
final class Dictation: ObservableObject {
@Published var partial: String = ""
private var pipe: WhisperKit?
private let engine = AVAudioEngine()
func start() async throws {
pipe = try await WhisperKit(
WhisperKitConfig(model: "base.en")
)
let input = engine.inputNode
let format = input.outputFormat(forBus: 0)
input.installTap(onBus: 0, bufferSize: 4096, format: format) { buf, _ in
Task { [weak self] in
// Feed 16kHz PCM chunks into WhisperKit's transcribe loop.
// WhisperAX uses an audio buffer + VAD ring; mirror that here.
guard let pipe = self?.pipe else { return }
let chunk = buf.toFloatArray()
if let r = try? await pipe.transcribe(audioArray: chunk) {
self?.partial = r.text
}
}
}
try engine.start()
}
}
AudioStreamTranscriber class. Production streaming with sub-200ms guarantees lives in Argmax Pro ↗.Homebrew is the fast path; swift run is the build-from-source path.
# 1. Install via Homebrew (macOS)
brew install whisperkit-cli
whisperkit-cli transcribe \
--model "large-v3" \
--audio-path input.wav
# 2. Or build from source
git clone https://github.com/argmaxinc/argmax-oss-swift.git
cd argmax-oss-swift
make setup
make download-model MODEL=large-v3-v20240930_626MB
swift run whisperkit-cli transcribe \
--model-path "Models/whisperkit-coreml/openai_whisper-large-v3-v20240930_626MB" \
--audio-path input.wav
# 3. OpenAI-compatible local server
swift run whisperkit-cli serve --model tiny --port 50060
# then call it from any OpenAI SDK with base_url=http://localhost:50060/v1
--stream to whisperkit-cli transcribe. For air-gapped builds run make download-model ahead of time and ship the .mlmodelc bundles. Models live at huggingface.co/argmaxinc/whisperkit-coreml ↗.What it really is
WhisperKit is an open-source Swift package from Argmax Inc. that runs OpenAI Whisper speech-to-text models entirely on Apple Silicon devices using CoreML. It exists because the reference Whisper code from OpenAI is Python+PyTorch and whisper.cpp — the most popular C++ port — treats the Apple Neural Engine as an opt-in extra rather than the primary execution path. WhisperKit compiles each Whisper variant into .mlmodelc bundles that the OS schedules across the Apple Neural Engine (ANE), GPU (Metal), and CPU automatically, so a single import gets idiomatic Swift-async transcription with hardware acceleration that whisper.cpp requires extra build flags and converted models to match.
The project was open-sourced under the MIT license in January 2024. On 2026-05-01 it graduated to v1.0.0 and was renamed the Argmax Open-Source SDK (repo argmaxinc/argmax-oss-swift), bundling three turn-key kits in one Swift package: WhisperKit (speech-to-text, Whisper), SpeakerKit (diarization, pyannote), and TTSKit (text-to-speech, Qwen3-TTS). The release adopts Swift 6 strict concurrency and vendors swift-transformers internally so consumer projects no longer pull HuggingFace's Hub library transitively.
Argmax distributes pre-converted CoreML weights for the entire Whisper family on HuggingFace at argmaxinc/whisperkit-coreml — tiny, base, small, medium, large-v2, large-v3, the September 2024 large-v3-v20240930 (better Spanish/Hindi/Korean), Distil-Whisper, plus quantized 'turbo' variants in the 547-955MB range that cut model size in half with minimal WER regression. Models download lazily on first use; whisperkit-cli ships via Homebrew (`brew install whisperkit-cli`) for command-line transcription, and a built-in OpenAI-compatible local server (POST /v1/audio/transcriptions) lets non-Swift apps call WhisperKit through the standard OpenAI SDK. Argmax also publishes a closed-source Pro SDK that adds real-time speaker-attributed transcription, custom vocabulary up to 3,000 terms, an Android Kotlin port, and a WebSocket streaming server compatible with Deepgram. The open-source package targets macOS 14+ and Xcode 16+; Apple Silicon (M1 or later, A14+ on iOS) is required for ANE acceleration. License: MIT.
Key specs
Performance (cited)
Get started — code
// Package.swift
dependencies: [
.package(url: "https://github.com/argmaxinc/argmax-oss-swift.git", from: "1.0.0"),
],
.target(
name: "YourApp",
dependencies: [
.product(name: "WhisperKit", package: "argmax-oss-swift"),
// Or .product(name: "ArgmaxOSS", ...) for WhisperKit + SpeakerKit + TTSKit
]
)
import WhisperKit
Task {
let pipe = try await WhisperKit()
let result = try await pipe.transcribe(audioPath: "path/to/audio.m4a")
print(result?.text ?? "")
}
// Pin a specific model:
let pipe = try await WhisperKit(WhisperKitConfig(
model: "large-v3-v20240930_626MB"
))
# Install via Homebrew
brew install whisperkit-cli
# Or build from source
git clone https://github.com/argmaxinc/argmax-oss-swift.git
cd argmax-oss-swift
make setup
make download-model MODEL=large-v3-v20240930_626MB
swift run whisperkit-cli transcribe \
--model-path "Models/whisperkit-coreml/openai_whisper-large-v3-v20240930_626MB" \
--audio-path audio.m4a
# Mic streaming
swift run whisperkit-cli transcribe --model-path ... --stream
# Start the WhisperKit server
swift run whisperkit-cli serve --model tiny --port 50060
# Call it with the standard OpenAI SDK:
python - <<'PY'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:50060/v1", api_key="unused")
resp = client.audio.transcriptions.create(
file=open("audio.wav", "rb"),
model="tiny",
)
print(resp.text)
PY
How it compares
vs whisper.cpp
whisper.cpp ships as portable C/C++ with a `WHISPER_COREML=1` build flag plus a separate `generate-coreml-model.py` step that converts the encoder only — the decoder still runs in ggml on CPU/Metal. WhisperKit ships pre-converted .mlmodelc bundles for both encoder and decoder, so the Apple Neural Engine handles the heavy attention layers without bridging headers, callback APIs, or manual memory management. On the M3 ANE, Argmax measured a 45% latency reduction (8.4ms → 4.6ms per decoder forward pass) versus a pre-CoreML baseline. Bottom line: whisper.cpp is the right answer for Linux servers and Intel Macs; WhisperKit is the right answer the moment you target Apple Silicon and want native Swift idioms.
vs whisperX
whisperX is a Python project that combines faster-whisper, wav2vec2 forced alignment, and pyannote diarization to produce word-timestamped, speaker-labeled transcripts on CUDA. WhisperKit's open-source surface is transcription only; diarization is now its sibling kit SpeakerKit (also pyannote, in the same Swift package as of v1.0.0); word-level forced alignment is not in the OSS package. To approximate whisperX behavior on a Mac, compose WhisperKit + SpeakerKit and use Whisper segment-level timestamps; for word-level alignment plus real-time speaker labels, Argmax Pro is the supported path.
vs faster-whisper
faster-whisper is a CTranslate2-based Whisper runtime — outstanding on NVIDIA GPUs and very strong on x86 CPU, but on Apple Silicon it cannot use the Neural Engine and lands on CPU. A Swift or Mac developer picking faster-whisper has to bundle Python (or use the C++ ctranslate2 lib through a custom binding), download non-CoreML weights, and lose ANE acceleration. WhisperKit's CoreML stack uses ANE + GPU + CPU automatically, integrates with Swift async/await, and ships through SPM. faster-whisper remains the right pick for Linux/CUDA servers; WhisperKit is the right pick on every Apple platform.
vs MacWhisper
MacWhisper is an end-user Mac transcription app built by Jordi Bruin on top of whisper.cpp; WhisperKit is the SDK other apps embed. Argmax does not publish a flagship end-user app — third-party apps like Superwhisper (App Store ID 6471464415) are the most prominent products in the WhisperKit ecosystem. If you want an app, use MacWhisper or Superwhisper; if you want to build the next one, use WhisperKit.
Who picks this
Every link in one place
Features
| Speaker diarization | No |
| Word-level timestamps | Yes |
| Streaming / real-time | Yes |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- argmaxinc/argmax-oss-swift ↗ ↗current repo · the legacy argmaxinc/WhisperKit URL redirects here
- v1.0.0 release · Argmax OSS SDK ↗ ↗rename + repackaging into WhisperKit + SpeakerKit + TTSKit (May 2026)
- Examples/WhisperAX ↗ ↗in-repo iOS + macOS sample app — start here for SwiftUI integration and streaming patterns
- Examples/ServeCLIClient ↗ ↗reference client for the OpenAI-compatible local server exposed by whisperkit-cli serve
- huggingface.co/argmaxinc/whisperkit-coreml ↗ ↗CoreML model collection · tiny through large-v3 + distil + turbo + quantized variants; ~10.9M downloads/month
- argmaxinc/whisperkittools ↗ ↗Python tooling for converting and benchmarking new Whisper checkpoints into CoreML
- WhisperKit benchmarks · HF Space ↗ ↗device-by-model latency + WER tables Argmax publishes alongside each release
- argmaxinc.com/blog ↗ ↗release notes and engineering posts — CoreML scheduling, ANE memory layout, Pro SDK announcements
- argmax-oss-swift · Discussions ↗ ↗primary Q&A surface for integration help, model selection, and device-specific performance threads
- argmaxinc.com ↗ ↗company site · Pro SDK (custom vocabulary, real-time speaker-attributed streaming, Android port) lives behind a 14-day trial here
WhisperKit vs Whipscribe
| Feature | WhisperKit | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | Yes | Yes |
| Streaming | Yes | No |
| Languages | 99 | 99 |
| Platforms | macOS, iOS, iPadOS, watchOS, visionOS | Web, API, MCP |
Alternatives to WhisperKit
Frequently asked about WhisperKit
Is WhisperKit the same as whisper.cpp on Mac?
No. whisper.cpp is a portable C/C++ Whisper port that runs Whisper on CPU with optional Metal GPU and an opt-in CoreML encoder; WhisperKit is a Swift-native package that compiles the full encoder and decoder to CoreML and lets the OS schedule layers across the Apple Neural Engine, GPU, and CPU automatically. WhisperKit is the right pick if you are shipping a Swift/SwiftUI app on Apple Silicon; whisper.cpp is the right pick when you need to run Whisper on Linux servers, Windows, Intel Macs, or embedded targets with no Apple framework available.
Does WhisperKit use the Apple Neural Engine (ANE)?
Yes. The CoreML model bundles published at huggingface.co/argmaxinc/whisperkit-coreml are compiled to run on ANE plus GPU plus CPU, and WhisperKit picks the compute units automatically. You can also pin them — e.g. `cpuAndNeuralEngine` to force ANE, `cpuAndGPU` to force Metal — via WhisperKitConfig.
How does WhisperKit compare to faster-whisper on Apple Silicon?
faster-whisper is a Python wrapper around CTranslate2 that targets CUDA and CPU; on Apple Silicon it falls back to CPU, so it does not use the Neural Engine and trails WhisperKit on Mac and iPhone. If you control the box and have an NVIDIA GPU, faster-whisper is excellent; if you ship a Mac or iOS app and want hardware acceleration without bundling Python or a CUDA runtime, WhisperKit wins by construction.
What's the difference between WhisperKit (open source) and Argmax Pro?
WhisperKit and the rest of the Argmax Open-Source SDK are MIT-licensed and ship the OpenAI Whisper, pyannote, and Qwen3-TTS models. Argmax Pro SDK is a closed-source extension with: real-time streaming transcription with live speaker attribution, custom-vocabulary support up to 3,000 keywords for domain accuracy, an Android/Kotlin port, a Deepgram-compatible WebSocket Local Server, and the Pro model variants (whisperkit-pro, parakeetkit-pro, speakerkit-pro). Pricing is on Argmax's site behind a 14-day trial.
Does WhisperKit support real-time / streaming transcription?
The open-source SDK supports microphone streaming via the CLI's `--stream` flag and partial-result streaming over Server-Sent Events from the local server, so you can build dictation-style apps. True real-time streaming with diarization and word-level latency guarantees is a Pro SDK feature; the open-source path streams transcripts as they're generated but does not promise sub-200ms first-token guarantees.
Where do I get the CoreML model files?
All variants are hosted at huggingface.co/argmaxinc/whisperkit-coreml. WhisperKit downloads the recommended model on first run; you can override with WhisperKitConfig(model:) using a glob like `large-v3-v20240930_626MB`. For air-gapped builds, run `make download-model MODEL=...` (or `make download-models` for the full set) and ship the resulting .mlmodelc bundles inside your app.
Is WhisperKit the same as whisperX?
No. whisperX is a Python project layering forced alignment (wav2vec2) and pyannote diarization on top of faster-whisper, primarily on CUDA. WhisperKit is a Swift CoreML inference framework; in the Argmax SDK 1.0.0 release, diarization is now a sibling kit (SpeakerKit, also pyannote-based) you can compose with WhisperKit, but word-level alignment is not part of the open-source surface. Visitors looking for whisperX behavior on Mac usually combine WhisperKit + SpeakerKit, or use Argmax Pro.
Does WhisperKit work on iPhone, iPad, Apple Watch, Vision Pro?
Yes. The package targets iOS, iPadOS, watchOS, and visionOS. Practical model size is the constraint: tiny and base run on Apple Watch and older iPhones; large-v3 quantized variants (547-626MB) target iPhone 15 Pro and newer with 8GB RAM. Vision Pro and M-series iPads run the full large-v3 comfortably.
What models should I use for production?
Argmax recommends `large-v3-v20240930_626MB` for maximum multilingual accuracy and `tiny` for fast iteration. The September 2024 v3 checkpoint is OpenAI's last Whisper update and noticeably better than 2023 large-v3 on Spanish, Hindi, and Korean. The `_turbo` suffix variants drop the heavy decoder for a lighter one with negligible WER regression on English; pick `_turbo_600MB` if real-time is the priority and `_626MB` non-turbo if WER is.
I searched 'whisperx on mac' — what should I use?
On Mac, WhisperKit + SpeakerKit covers the diarization half of whisperX with hardware acceleration the Python whisperX stack can't reach. You lose word-level forced alignment in the open-source path; if you need it, either run whisperX in a Linux Docker container or move to Argmax Pro.
Does WhisperKit support faster-whisper-style Apple Silicon Metal acceleration?
WhisperKit goes further than Metal: it uses CoreML, which schedules across ANE + GPU + CPU based on layer cost. faster-whisper has no Metal backend at all on Apple Silicon — it is CPU-only there. If your search was 'faster-whisper apple silicon metal support', WhisperKit is the answer for that intent.
Is WhisperKit free to use?
Yes — WhisperKit and the rest of the Argmax Open-Source SDK are MIT-licensed and free for commercial use. Argmax also publishes a closed-source Pro SDK with custom-vocabulary, real-time speaker-attributed streaming, and an Android port; pricing is on argmaxinc.com.
Does WhisperKit run on iOS?
Yes. WhisperKit ships on macOS 14+, iOS 17+, watchOS 10+, and visionOS — all CoreML-accelerated on Apple Silicon. Inference happens fully on-device; no network round-trip is required.
Does it work on Intel Macs?
It installs (Swift package, no architecture lock) but the CoreML weights are tuned for Apple Silicon. Intel Macs have no Neural Engine, so compute falls back to CPU + GPU and performance is similar to whisper.cpp's CPU mode.
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.