Whisper API vs Whipscribe: what you actually pay and get in 2026
OpenAI's Whisper API is $0.006 per minute. Whipscribe is $1 per hour flat. Underneath the two run the same model family. The real question is what each price covers — and when calling the raw API saves you money, and when it quietly costs more.
The headline pricing
Per OpenAI's public pricing page (checked 2026-04-24), the Whisper v2 audio-transcription model is $0.006 per minute — that's $0.36 per hour of audio. OpenAI has since introduced GPT-4o-mini-transcribe at $0.003 per minute for cost-sensitive use.
Whipscribe is $1 per hour of audio on pay-as-you-go, or $8/month Pro with a 100-hour monthly cap. Credits never expire.
| Cost line | OpenAI Whisper API | Whipscribe |
|---|---|---|
| Per hour of audio | $0.36 | $1.00 PAYG · effectively $0.08/hr at 100-hr Pro cap |
| Per minute | $0.006 | $0.0167 PAYG |
| Free tier | No free tier on the API itself | 30 min/day every day, no sign-up |
| Volume commitment | Pay-as-you-go per API call | No minimums on PAYG; cancel Pro any time |
On raw inference, the API wins on cost. That's the whole honest answer if all you care about is the number.
What the $0.36/hr gets you (and doesn't)
The Whisper API accepts a single audio file up to 25 MB and returns JSON. That's the full contract. Everything else is your problem:
- File > 25 MB — you chunk it yourself, track offsets, re-align timestamps.
- URL inputs — the API takes a file blob, not a URL. For YouTube, Zoom, or a direct-download link, you handle the download, the format conversion, and the error cases (bot checks, rate limits, missing audio tracks).
- Speaker diarization — Whisper transcribes but does not identify speakers. You add
pyannote-audioorwhisperX, obtain a HuggingFace token with the gated-model acceptance, run diarization as a second pass, and align the outputs. - Word-level timestamps — available via the
timestamp_granularitiesparameter, but you format the SRT/VTT yourself. - Exports — DOCX with paragraph breaks, human-readable TXT with speaker turns, JSON shapes that downstream tools expect. All downstream work.
- Retention and sharing — shareable links, search across old transcripts, team access. Build it.
- A UI for someone who isn't you — every non-technical colleague. Build it.
What the $1/hr covers
Whipscribe runs production faster-whisper plus whisperX behind the web UI, REST API, and MCP server. The same $1/hr price covers the full pipeline:
- Paste a YouTube, Vimeo, or direct-download URL — we pull the audio.
- Upload files up to multi-hour length; we chunk and re-align internally.
- Speaker diarization runs by default on every upload.
- Word-level timestamps in SRT and VTT out of the box.
- Exports: TXT, SRT, VTT, DOCX, JSON.
- Shareable view links, retention, batch folder uploads.
- MCP server (
whipscribe_mcpon PyPI) so Claude Desktop or Cursor can call it directly.
When the API is the right answer
Use the raw Whisper API when all three of these are true:
- The transcript is an internal step in a larger product — feeding an LLM summary, populating a database field, powering a search index. Not something a human reads.
- Speaker attribution doesn't matter. Single-speaker audio, monologue-only content, voice notes.
- You already control the upload path and the audio is under 25 MB per file.
In that world, $0.36/hr is the right number and you don't need anything sitting on top of it.
When Whipscribe is the right answer
Whipscribe is the right answer when any of the following are true:
- A human will read or edit the transcript.
- The source is a URL (YouTube, Zoom, podcast RSS) rather than a file you control.
- You need speaker labels, word-level SRT, or DOCX exports.
- You or your team transcribe audio periodically, not as a product backend — in which case the sticker price matters less than time-to-transcript.
- You're calling it from Claude Desktop or Cursor via MCP, and you don't want to run your own server.
Paste a URL or upload a file — see the output before you decide on either path.
Open Whipscribe →A back-of-envelope example
You're a solo developer building a podcast-summary tool. 200 episodes per month, average 45 minutes per episode = 150 hours.
- Whisper API direct: 150 × $0.36 = $54 per month in inference, plus the build time for chunking, diarization, URL ingest, and exports. Realistically 40-60 engineering hours to get to feature parity with a hosted tool, then ongoing maintenance.
- Whipscribe Pro: $8 per month up to 100 hours, then $1/hr beyond. 150 hours → $8 + 50 × $1 = $58. Exports, diarization, URL ingestion already done.
At this volume the costs basically match. The difference is the 40-60 hours you didn't spend building the pipeline. That's the entire value proposition of a hosted tool at this price point — and it's why "the API is $0.36/hr" is technically true but almost never the right framing once you price your own time in.
The underlying model is the same family
This isn't a quality-vs-price tradeoff; it's a build-vs-buy one. The OpenAI Whisper API uses Whisper v2. Whipscribe runs faster-whisper, a rewrite of the same model family that's up to 4x faster at equal accuracy per the faster-whisper repository on GitHub (checked 2026-04-24). Add whisperX for forced alignment and word-level timestamps. In practice, transcript quality on a typical podcast interview is close enough that the average user can't tell — the differences are model size choice, VAD handling, and audio preprocessing. None of them are price-tier differentiators.
Frequently asked
What does the Whisper API actually cost?
$0.006 per minute per OpenAI's public pricing page (checked 2026-04-24). That's $0.36 per hour. OpenAI's newer GPT-4o-mini-transcribe is $0.003 per minute for cost-sensitive workloads.
How does $1 per hour compare to $0.006 per minute?
On raw inference, the API is cheaper ($0.36 vs $1). The difference is what each price covers. The API gives you a JSON back from one file. Whipscribe at $1/hr covers URL ingestion, diarization, word-level timestamps, multi-format exports, retention, UI, and MCP access.
Does the Whisper API do speaker diarization?
No. Whisper transcribes but does not identify speakers. For diarization you add pyannote-audio or whisperX, manage a HuggingFace token, align the outputs, and handle the extra compute. This is the biggest hidden cost.
When should I just use the Whisper API directly?
When a transcript is one internal step in a larger product, you don't need speaker labels, and you control the upload path. API + your own pipeline is the right call.
Can I get the Whipscribe workflow without the web UI?
Yes. Whipscribe ships an MCP server so you can call it from Claude Desktop, Cursor, or any MCP client. Same diarization, same exports, no browser.
Transcript with speaker labels and word-level SRT, $1 per hour of audio — no chunking, no diarization setup, no token management.
Try Whipscribe →