The framework itself is free under the MIT license, and OpenAI's Whisper model weights it loads are also openly licensed. The cost is your developer time to integrate it, the Apple-Silicon device you ship on, and the engineering cost of every layer above raw inference — UI, file handling, exports, sharing, retention. WhisperKit gives you the engine; you build the car.

WhisperKit vs Whipscribe (2026): a Swift framework for devs vs a hosted product for everyone

Q: Should I choose WhisperKit or Whipscribe?

WhisperKit if you are a Swift developer building a feature into your own iOS or Mac app and you want on-device, offline, private inference. Whipscribe if you are anyone else — a podcaster, journalist, researcher, founder, student, or an AI agent calling our MCP server. The two products solve different jobs for different audiences.

Q: What models does WhisperKit support?

The full Whisper family compiled to CoreML format: Tiny, Base, Small, Medium, Large-v2, Large-v3, and Large-v3 Turbo, plus distilled variants. The typical production recipe on iPhone 13 and newer is Large-v3 Turbo — near-Large accuracy at roughly 5x throughput, fitting in about 1.6 GB on disk and streaming in real time on the Apple Neural Engine.

Q: How fast is WhisperKit on an iPhone?

Argmax's published benchmarks show WhisperKit hitting sub-200 ms first-word latency on modern iPhones with Large-v3 Turbo on the Apple Neural Engine, and 15-30x real-time on Apple Silicon Macs for batch transcription. Real-world numbers depend on the device generation, the chosen model, thermal headroom, and audio length.

May 8, 2026 · Neugence · 12 min read

These two names get listed side-by-side a lot, and that comparison is misleading. WhisperKit — renamed to argmaxinc/argmax-oss-swift in v1.0.0 in 2026 — is a Swift package you import into your iOS or Mac app. Whipscribe is a hosted product you (or your AI agent) call. The first is for Apple-platform developers building a feature; the second is for anyone with audio and a deadline. Below is the honest decision matrix and the worked examples for when each one is the right tool.

The two-second decision

Use WhisperKit if

You're a Swift developer shipping a feature

You're building an iOS, macOS, watchOS, or visionOS app and want voice transcription inside it.
Your audio must stay on the device — patient notes, legal recordings, internal memos.
You want offline operation by default and you're willing to ship model weights with your app.
Your users are on Apple Silicon and you don't need Windows / Android / browser parity.

Use Whipscribe if

You have audio and need a transcript

You're a podcaster, journalist, researcher, founder, lawyer, student, or operator with files and URLs.
You want to paste a YouTube link and get back a clean transcript with speaker labels.
You're an AI agent calling an MCP server to fetch transcripts inside a Claude / GPT workflow.
You need browser access from any laptop or phone, regardless of OS.

If you're not writing Swift this week, the rest of this post mostly serves as context — Whipscribe is the answer for you, full stop. If you are writing Swift and the question is "which one do I integrate?", read on.

What WhisperKit actually is

WhisperKit is an open-source Swift package from Argmax that runs OpenAI's Whisper models on Apple devices, accelerated by Apple's CoreML framework and the Apple Neural Engine. It's MIT-licensed, distributed via the Swift Package Manager, and runs on iPhone, iPad, Mac, Apple Watch, and Vision Pro.

In Argmax's own benchmarks, an optimized Large-v3 Turbo model on the Apple Neural Engine hits roughly 2.2% word error rate on standard English benchmarks at sub-200 ms first-word latency, with batch transcription running at 15–30× real-time on Apple Silicon Macs. On iPhone 13 and newer, Large-v3 Turbo is the typical production choice — near-Large accuracy, ~1.6 GB on disk, comfortable real-time streaming.

Naming note (May 2026). The package was renamed from argmaxinc/WhisperKit to argmaxinc/argmax-oss-swift at v1.0.0. The Swift module is still called WhisperKit, and the new SDK ships SpeakerKit (diarization) and TTSKit (text-to-speech) under the same umbrella. Most articles you'll find written before May 2026 still call it WhisperKit; both names point at the same code.

Adding it to a project is a one-liner:

// Package.swift
.package(url: "https://github.com/argmaxinc/argmax-oss-swift", from: "1.0.0")

// in your view-model
import WhisperKit

let pipe = try await WhisperKit()
let result = try await pipe.transcribe(audioPath: url.path)
print(result?.text ?? "")

That's the entry-point. Real apps build a lot of code on top of those four lines.

What WhisperKit gives developers

CoreML / ANE acceleration on Apple Silicon. Whisper compiled to CoreML routes layers to the Apple Neural Engine, GPU, or CPU as the runtime sees fit. The result is the headline 15–30× real-time number on Apple Silicon Macs and sub-200 ms latency on recent iPhones.
Full Whisper model family. Tiny, Base, Small, Medium, Large-v2, Large-v3, Large-v3 Turbo, and distilled variants — all available pre-compiled to CoreML on Hugging Face under argmaxinc/whisperkit-coreml.
Real-time streaming + voice activity detection. The package ships streaming transcription, word-level timestamps, and built-in VAD — the primitives most dictation apps need.
MIT license. Free for commercial use, no royalty, no telemetry phone-home.
Small framework footprint. The Swift package itself is tiny; the heavy bits are the model weights you choose to bundle (75 MB Tiny → 3 GB Large-v3) or download on first launch.
Apple-platform breadth. One codebase covers iOS, iPadOS, macOS, watchOS, and visionOS via SPM.
Companion kits in v1.0.0. SpeakerKit adds Pyannote-style diarization; TTSKit adds Qwen3-TTS voice synthesis — same SDK, same Swift import path.

What you build on top of it

This is the part that often gets glossed over in framework-vs-product comparisons. WhisperKit returns a string (and timestamps). Everything that turns that string into a usable product, you write yourself.

Layer	WhisperKit gives you	You build
Inference	Whisper model running on ANE / GPU	Model selection UX, download manager, version checks
Audio input	Path-to-string transcription	Mic capture, file picker, drag-drop, URL fetch (YouTube, podcasts, RSS)
Speakers	SpeakerKit diarization (separate kit)	Stitching speaker labels back to transcript timestamps
Long files	Streaming chunks	Progress UI, chunk merging, error recovery, background tasks
Exports	Plain text + timestamps	SRT, VTT, DOCX, JSON, captions burn-in, share sheets
Storage / sync	Nothing — it's a library	iCloud sync, retention rules, search, tagging, trash
App polish	Nothing	Onboarding, empty states, errors, accessibility, App Store listing

Apps like MacWhisper, Superwhisper, VoiceInk, and Voibe all bundle Whisper (via WhisperKit or whisper.cpp) and then write the rest. That "rest" is the actual product.

Whipscribe is a different shape

Whipscribe is the assembled product. You don't write Swift, you don't bundle weights, you don't manage downloads. You hand over audio (a file or a URL) and get back a transcript with speaker labels and exports — through a browser, the Chrome extension, the REST API, or an MCP server an AI agent calls on your behalf.

Dimension	WhisperKit (argmax-oss-swift)	Whipscribe
Shape	Swift package · developer SDK	Hosted product · browser + API + MCP
Audience	Apple-platform developers	Anyone with audio, plus AI agents
Where it runs	On the user's iPhone / iPad / Mac	Server GPUs (cloud)
Platforms	iOS · iPadOS · macOS · watchOS · visionOS	Any browser · iOS · Android · CLI · MCP
Models	Whisper Tiny → Large-v3 Turbo (you pick)	Whisper Large-v3 + diarization (managed)
Offline	Yes — full offline	No — internet required
On-device privacy	Audio never leaves the device	Audio uploaded to server
License / cost	MIT · free + your dev time + Apple device	$0 free tier · $2/hr PAYG · $12 / $29 monthly
Diarization	SpeakerKit (separate, you wire it up)	Built-in on every paid tier
YouTube / RSS / URL	You build the fetcher	Paste a URL — done
Exports (SRT/DOCX/JSON)	You build them	Built-in
Time to first transcript	Hours-to-days (integration)	Seconds (paste a URL)
AI-agent access (MCP)	No	Yes — public MCP server

Worked example 1 — indie iOS dev shipping a voice memo app

You're building "Folio Notes," an iPhone-first voice-memo app that auto-transcribes recordings, lets the user search across them, and never sends audio off the device. Your target user is a knowledge worker who wants Apple-native UX and privacy.

Right tool: WhisperKit. Bundle Large-v3 Turbo (~1.6 GB on first launch via Hugging Face), wire WhisperKit().transcribe to your AVFoundation mic capture, store transcripts in CoreData. The on-device, offline, private story is the entire pitch — outsourcing transcription to a cloud API would undermine it. Whipscribe is the wrong choice here.

Worked example 2 — same dev, podcast repurposing pipeline

The Folio Notes team now wants to ship a side feature: subscribe to a podcast RSS feed, auto-transcribe each new episode, and turn the transcript into blog-post drafts via Claude. The audio is hours long, runs server-side at night, and goes through a Claude agent that picks quotes and writes the draft.

Right tool: Whipscribe. Pinning an iPhone for two hours per episode to run Whisper locally is bad UX, terrible battery economics, and impossible if the user closes the app. The Whipscribe MCP server lets the Claude agent call transcribe_url directly, batch through the backlog on server GPUs, and hand quote-ready transcripts back to the model. WhisperKit is the wrong choice for this leg of the same product.

Same team, same product — two transcription tools, picked by job. That's the realistic answer for most apps in 2026.

Worked example 3 — non-developer with audio

You host a weekly podcast, you have last week's interview as an .m4a, and you want a clean transcript with speaker labels so you can pull quote graphics. You don't write Swift. You don't want to write Swift.

Right tool: Whipscribe. WhisperKit isn't a product — it's a library. There is no "open WhisperKit and drop a file in." For everyone who isn't an Apple-platform developer, the entire framework category is the wrong layer of the stack. You want an assembled product where someone has already written the UI, the file handling, the diarization wiring, and the export buttons.

Skip the integration · paste a URL

30 free minutes a day · no sign-up

If you have audio and a deadline, the framework path is the long way around. Drop a file or a YouTube link, get a transcript with speaker labels, exports included.

Try Whipscribe →

Honest tradeoffs — neither tool is a free lunch

Where WhisperKit wins outright

Privacy that actually holds up. Audio never leaves the device. For HIPAA-adjacent use cases, lawyer-client recordings, internal HR conversations, or any "no-cloud" policy you genuinely need to honor, this is the only correct answer in this comparison.
Total offline operation. Field journalists, researchers on flights, anyone whose primary failure mode is "no internet right now."
Zero per-hour cost. Once shipped, every transcript a user runs is free in marginal terms — your CPU bill is the user's battery, not your AWS invoice.
Apple-platform tight integration. Share sheets, Shortcuts, Live Activities, Live Captions on visionOS — the kinds of things only a native framework can plug into cleanly.

Where Whipscribe wins outright

Time to first transcript. Seconds, not days of integration.
Cross-platform, cross-OS. Anyone with a browser. Anyone with an MCP-capable agent. The Apple-only constraint disappears.
Long files don't punish your hardware. A 4-hour podcast on a server GPU finishes in minutes; the same job on an iPhone runs the battery into the floor.
Diarization, exports, sharing, retention — assembled. Every layer above raw inference is already built.
AI agents talk to it. Claude, ChatGPT (via Actions), or any MCP client can call Whipscribe directly. WhisperKit has no remote API by design.

Where each one has real limits

WhisperKit can't run on Windows or Android and isn't trying to. If your app needs cross-platform, this isn't your framework — look at whisper.cpp instead. Honest disclosures about Whipscribe's compliance posture live on our security page, not buried in this article.
Whipscribe is cloud-only. If your audio cannot leave your hardware, do not use Whipscribe. WhisperKit is the right call.
WhisperKit requires engineering. Even with Argmax's polished SDK, "ship a transcription product on top of WhisperKit" is months of work. Whipscribe was that work, finished, for everything that isn't a Swift app.
Whipscribe doesn't run on a watch. Live captions on Apple Watch, real-time dictation on visionOS — that's a job for an on-device framework, not a hosted API.

The hybrid pattern: use both

The teams getting this right in 2026 use both tools, picked per use case:

WhisperKit for the live mic surface — voice memos, dictation, real-time captions, anything where latency under 200 ms and privacy are the headline.
Whipscribe for the long-form, batch, or cross-platform surface — podcast episodes, meeting recordings (when policy allows), URL ingestion, AI-agent workflows, anything web-served.

A Swift app calling Whipscribe for batch jobs is just an HTTP client — same shape as calling any other service. Add a flag in your settings ("transcribe long files in the cloud — faster, requires internet") and you've shipped the union.

Pricing — the honest comparison

Path	What you pay	What it covers
WhisperKit	$0 license · your dev time · users' Apple device	Framework + the Whisper weights you bundle. Engineering, UX, exports, distribution all on you.
Whipscribe Free	$0	30 min / day, every day. No sign-up, no card.
Whipscribe PAYG	$2 / hour of audio	Per-hour billing. Diarization included. Spiky usage friendly.
Whipscribe Pro	$12 / month	100 hours / month. One-person backlog cleanup.
Whipscribe Team	$29 / month	500 hours / month. Podcast networks, research teams.

WhisperKit's $0 framework cost is real, but it's not the comparable number — engineering time and device hardware are the real line items. For a solo dev's voice-memo app, the math works out. For "I have audio, I want a transcript by Friday," it does not.

Frequently asked

What is WhisperKit?

WhisperKit is an open-source Swift package from Argmax that runs OpenAI's Whisper speech-recognition models on Apple Silicon, accelerated via CoreML and the Apple Neural Engine. As of v1.0.0 in 2026 the project was renamed to argmaxinc/argmax-oss-swift and ships SpeakerKit (diarization) and TTSKit (text-to-speech) under the same SDK. It's a framework for developers, not a consumer product.

Is WhisperKit free?

The framework itself is free under the MIT license, and the Whisper weights it loads are openly licensed. The cost is your developer time to integrate it, the Apple device you ship on, and every product layer above raw inference — UI, exports, sharing, retention.

Should I choose WhisperKit or Whipscribe?

WhisperKit if you're a Swift developer building a feature into your iOS or Mac app and you want on-device, offline, private inference. Whipscribe if you're anyone else — a podcaster, journalist, researcher, founder, student, or an AI agent calling our MCP server. Different jobs, different audiences.

What models does WhisperKit support?

The full Whisper family compiled to CoreML: Tiny, Base, Small, Medium, Large-v2, Large-v3, Large-v3 Turbo, and distilled variants. The typical production recipe on iPhone 13 and newer is Large-v3 Turbo — near-Large accuracy at roughly 5× throughput, ~1.6 GB on disk, real-time streaming on the Apple Neural Engine.

How fast is WhisperKit on an iPhone?

Argmax's published benchmarks show sub-200 ms first-word latency on modern iPhones with Large-v3 Turbo on the Apple Neural Engine, and 15–30× real-time on Apple Silicon Macs for batch transcription. Real numbers depend on device generation, model size, thermal headroom, and audio length.

Does Whipscribe run on-device?

No. Whipscribe is a hosted service — audio is uploaded to our server GPUs, transcribed with Whisper Large-v3 plus diarization, and the result is returned to your browser, our Chrome extension, or an MCP client. If on-device privacy is a hard requirement, WhisperKit is the right tool.

Can I use WhisperKit on Windows or Android?

No — WhisperKit is Apple-platform only. It depends on Swift, CoreML, and the Apple Neural Engine. For cross-platform on-device Whisper, look at whisper.cpp. For cross-platform hosted transcription, Whipscribe works in any browser on any OS.

Can I call Whipscribe from a Swift app?

Yes. Whipscribe exposes a REST API and an MCP server, so a Swift app can post audio (or a URL) and read back JSON, SRT, VTT, DOCX, or TXT — same as any HTTP client. Some teams ship hybrid: WhisperKit on-device for live mic input, Whipscribe for long-batch jobs where the iPhone's battery and thermal budget are the bottleneck.

If you're shipping a Swift app and you want on-device transcription, WhisperKit is the right tool. If you have audio and you want a transcript today, that's us.

Try Whipscribe →

The two-second decision

You're a Swift developer shipping a feature

You have audio and need a transcript

What WhisperKit actually is

What WhisperKit gives developers

What you build on top of it

Whipscribe is a different shape

Worked example 1 — indie iOS dev shipping a voice memo app

Worked example 2 — same dev, podcast repurposing pipeline

Worked example 3 — non-developer with audio

Honest tradeoffs — neither tool is a free lunch

Where WhisperKit wins outright

Where Whipscribe wins outright

Where each one has real limits

The hybrid pattern: use both

Pricing — the honest comparison

Frequently asked

Related