How does $1 per hour compare to the Whisper API's $0.006 per minute?

On the raw inference cost, the Whisper API is cheaper — $0.36 per hour vs $1 per hour. The difference is what each price covers. Whisper API gives you a JSON response from one audio file you uploaded. Whipscribe at $1/hr covers URL ingestion, speaker diarization, word-level timestamps, SRT/VTT/DOCX/JSON exports, a UI, retention, and MCP access.

Can I get the Whipscribe workflow without using the web UI?

Yes. Whipscribe ships an MCP server so you can call it from Claude Desktop, Cursor, or any MCP client directly. Same diarization, same exports, no browser involved.

Whisper API vs Whipscribe: what you actually pay and get in 2026

Q: What does the Whisper API actually cost?

OpenAI's Whisper API is $0.006 per minute of audio per OpenAI's public pricing page (checked 2026-04-24). That's $0.36 per hour of audio. OpenAI has also introduced GPT-4o-mini-transcribe at $0.003 per minute for cost-sensitive workloads.

Q: Does the Whisper API do speaker diarization?

No. Whisper itself transcribes but does not identify speakers. For diarization you add a separate library like pyannote-audio or whisperX, manage a HuggingFace token, align the outputs, and handle the extra compute. This is the biggest hidden cost when people build on the raw API.

Q: When should I just use the Whisper API directly?

When you're building a product that transcribes audio as one internal step and you don't need speaker labels, URL ingestion, or a UI. If the transcript's final destination is your own database or LLM call and you control the upload path, the API is the right primitive.

April 24, 2026 · Neugence · 9 min read

OpenAI's Whisper API is $0.006 per minute. Whipscribe is $1 per hour flat. Underneath the two run the same model family. The real question is what each price covers — and when calling the raw API saves you money, and when it quietly costs more.

The sticker-price comparison. Read on for what each price actually covers — it's the real answer.

The headline pricing

Per OpenAI's public pricing page (checked 2026-04-24), the Whisper v2 audio-transcription model is $0.006 per minute — that's $0.36 per hour of audio. OpenAI has since introduced GPT-4o-mini-transcribe at $0.003 per minute for cost-sensitive use.

Whipscribe is $1 per hour of audio on pay-as-you-go, or $8/month Pro with a 100-hour monthly cap. Credits never expire.

Cost line	OpenAI Whisper API	Whipscribe
Per hour of audio	$0.36	$1.00 PAYG · effectively $0.08/hr at 100-hr Pro cap
Per minute	$0.006	$0.0167 PAYG
Free tier	No free tier on the API itself	30 min/day every day, no sign-up
Volume commitment	Pay-as-you-go per API call	No minimums on PAYG; cancel Pro any time

On raw inference, the API wins on cost. That's the whole honest answer if all you care about is the number.

The $0.64 delta per hour is the difference between "a JSON response" and "a product." At 100 hours/month that's $64 vs 40+ engineering hours.

What the $0.36/hr gets you (and doesn't)

The Whisper API accepts a single audio file up to 25 MB and returns JSON. That's the full contract. Everything else is your problem:

File > 25 MB — you chunk it yourself, track offsets, re-align timestamps.
URL inputs — the API takes a file blob, not a URL. For YouTube, Zoom, or a direct-download link, you handle the download, the format conversion, and the error cases (bot checks, rate limits, missing audio tracks).
Speaker diarization — Whisper transcribes but does not identify speakers. You add pyannote-audio or whisperX, obtain a HuggingFace token with the gated-model acceptance, run diarization as a second pass, and align the outputs.
Word-level timestamps — available via the timestamp_granularities parameter, but you format the SRT/VTT yourself.
Exports — DOCX with paragraph breaks, human-readable TXT with speaker turns, JSON shapes that downstream tools expect. All downstream work.
Retention and sharing — shareable links, search across old transcripts, team access. Build it.
A UI for someone who isn't you — every non-technical colleague. Build it.

What the $1/hr covers

Whipscribe runs production faster-whisper plus whisperX behind the web UI, REST API, and MCP server. The same $1/hr price covers the full pipeline:

Paste a YouTube, Vimeo, or direct-download URL — we pull the audio.
Upload files up to multi-hour length; we chunk and re-align internally.
Speaker diarization runs by default on every upload.
Word-level timestamps in SRT and VTT out of the box.
Exports: TXT, SRT, VTT, DOCX, JSON.
Shareable view links, retention, batch folder uploads.
MCP server (whipscribe_mcp on PyPI) so Claude Desktop or Cursor can call it directly.

The real cost calculation isn't $0.36 vs $1.00. It's $0.36 + your engineering time vs $1.00 shipped. Anyone who has actually assembled chunked ingestion, diarization, and exports around the raw API knows it's a week of real work, plus ongoing maintenance when OpenAI rotates parameter names or HuggingFace tokens expire.

When the API is the right answer

Use the raw Whisper API when all three of these are true:

The transcript is an internal step in a larger product — feeding an LLM summary, populating a database field, powering a search index. Not something a human reads.
Speaker attribution doesn't matter. Single-speaker audio, monologue-only content, voice notes.
You already control the upload path and the audio is under 25 MB per file.

In that world, $0.36/hr is the right number and you don't need anything sitting on top of it.

When Whipscribe is the right answer

Whipscribe is the right answer when any of the following are true:

A human will read or edit the transcript.
The source is a URL (YouTube, Zoom, podcast RSS) rather than a file you control.
You need speaker labels, word-level SRT, or DOCX exports.
You or your team transcribe audio periodically, not as a product backend — in which case the sticker price matters less than time-to-transcript.
You're calling it from Claude Desktop or Cursor via MCP, and you don't want to run your own server.

Free to try

30 minutes a day, no sign-up, no credit card

Paste a URL or upload a file — see the output before you decide on either path.

Open Whipscribe →

Raw inference: API is cheaper below ~22 hours/month, Pro is cheaper between 22 and 100 hours, and they converge again beyond that. The real deciding factor isn't the line you pick — it's the engineering time off-chart.

A back-of-envelope example

You're a solo developer building a podcast-summary tool. 200 episodes per month, average 45 minutes per episode = 150 hours.

Whisper API direct: 150 × $0.36 = $54 per month in inference, plus the build time for chunking, diarization, URL ingest, and exports. Realistically 40-60 engineering hours to get to feature parity with a hosted tool, then ongoing maintenance.
Whipscribe Pro: $8 per month up to 100 hours, then $1/hr beyond. 150 hours → $8 + 50 × $1 = $58. Exports, diarization, URL ingestion already done.

At this volume the costs basically match. The difference is the 40-60 hours you didn't spend building the pipeline. That's the entire value proposition of a hosted tool at this price point — and it's why "the API is $0.36/hr" is technically true but almost never the right framing once you price your own time in.

The hidden cost of "just use the API." The inference is cheap; the product around it is not.

The underlying model is the same family

This isn't a quality-vs-price tradeoff; it's a build-vs-buy one. The OpenAI Whisper API uses Whisper v2. Whipscribe runs faster-whisper, a rewrite of the same model family that's up to 4x faster at equal accuracy per the faster-whisper repository on GitHub (checked 2026-04-24). Add whisperX for forced alignment and word-level timestamps. In practice, transcript quality on a typical podcast interview is close enough that the average user can't tell — the differences are model size choice, VAD handling, and audio preprocessing. None of them are price-tier differentiators.

Frequently asked

What does the Whisper API actually cost?

$0.006 per minute per OpenAI's public pricing page (checked 2026-04-24). That's $0.36 per hour. OpenAI's newer GPT-4o-mini-transcribe is $0.003 per minute for cost-sensitive workloads.

How does $1 per hour compare to $0.006 per minute?

On raw inference, the API is cheaper ($0.36 vs $1). The difference is what each price covers. The API gives you a JSON back from one file. Whipscribe at $1/hr covers URL ingestion, diarization, word-level timestamps, multi-format exports, retention, UI, and MCP access.

Does the Whisper API do speaker diarization?

No. Whisper transcribes but does not identify speakers. For diarization you add pyannote-audio or whisperX, manage a HuggingFace token, align the outputs, and handle the extra compute. This is the biggest hidden cost.

When should I just use the Whisper API directly?

When a transcript is one internal step in a larger product, you don't need speaker labels, and you control the upload path. API + your own pipeline is the right call.

Can I get the Whipscribe workflow without the web UI?

Yes. Whipscribe ships an MCP server so you can call it from Claude Desktop, Cursor, or any MCP client. Same diarization, same exports, no browser.

Transcript with speaker labels and word-level SRT, $1 per hour of audio — no chunking, no diarization setup, no token management.

Try Whipscribe →

The headline pricing

What the $0.36/hr gets you (and doesn't)

What the $1/hr covers

When the API is the right answer

When Whipscribe is the right answer

A back-of-envelope example

The underlying model is the same family

Frequently asked

Related