Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
OpenAI Whisper API
OpenAI's hosted Whisper API — gpt-4o-transcribe is the newest model, gpt-4o-mini-transcribe is the cheaper sibling, whisper-1 is the legacy reference. One REST surface.
OpenAI exposes hosted speech-to-text on three models behind a single REST endpoint, POST https://api.openai.com/v1/audio/transcriptions: gpt-4o-transcribe and gpt-4o-mini-transcribe (newer GPT-4o-class transcribers, faster and tuned for noisier audio) and whisper-1 (the legacy reference model running Whisper large-v2). The same surface also serves audio-to-English at /v1/audio/translations, and the parallel Realtime API (gpt-4o-realtime-preview) carries streaming STT over WebSocket / WebRTC for live agents.
Best for teams already paying for OpenAI infra, anyone mixing transcription with chat / function-calling in the same SDK, or workflows where vendor consolidation beats per-minute price. Current per-minute rates: gpt-4o-transcribe at $0.006/min, gpt-4o-mini-transcribe at $0.003/min, whisper-1 at $0.006/min, billed to the nearest second. Hard limits: 25 MB per request, formats mp3 / mp4 / mpeg / mpga / m4a / wav / webm, no built-in diarization, no batch discount.
What it is
OpenAI's hosted Whisper API is the easiest way to get Whisper-grade transcription without running infrastructure. $0.006 per minute, pay-as-you-go. No diarization, no streaming — if you need those, pick a different endpoint or self-host whisperX. Last price check: 2026-04-20.
Watch out for: 25 MB file size limit; no diarization; no batch discount; latency dominated by upload for large files.
Install / use
Where the Whisper API fits · 6 use-cases
OpenAI's audio surface is small but covers the common shapes: subtitles, translation, long-form chunking, real-time, telephony, and multilingual. Each card points at the matching section on platform.openai.com — pick the closest one and copy a recipe below.
Pass response_format=srt or response_format=vtt to /v1/audio/transcriptions and the API returns a subtitle file directly — no client-side stitching. For finer timing pass response_format=verbose_json with timestamp_granularities=['segment','word'] to get per-segment and per-word offsets on whisper-1 or gpt-4o-transcribe.
verbose_json + timestamp_granularities for word-level offsets
The sibling endpoint /v1/audio/translations transcribes non-English audio directly into English text in one call — useful when downstream tooling is English-only. Currently supported on whisper-1; for other source/target pairs run /transcriptions then a chat model.
Translation task is English-only output
Files over 25 MB must be split client-side before upload. The OpenAI cookbook pattern uses pydub to slice on silence boundaries, transcribe each chunk in parallel, then concatenate. Alternative: compress to 16 kHz mono opus / m4a to fit a 60-90 min episode under 25 MB.
Hard 25 MB cap per request — no server-side chunking
For sub-second streaming STT use the Realtime API (gpt-4o-realtime-preview), not /audio/transcriptions — the REST endpoint is request/response only. Realtime carries bidirectional audio over WebSocket or WebRTC and is the path for voice agents, live captions, and IVR replacements.
Separate SKU and endpoint from /audio/transcriptions
gpt-4o-mini-transcribe at $0.003/min is the budget pick for high-volume call recordings and voicemail batches where every cent matters and word-level timestamps are optional. No diarization in the response — pair with a downstream speaker-attribution model or a tool that bundles it.
Half the per-minute cost of gpt-4o-transcribe and whisper-1
All three models auto-detect the input language; pass an ISO-639-1 language hint (e.g. language='ja') to skip detection and improve accuracy on short clips. Quality varies by language — large-v2 weights underneath whisper-1 are the same set the open-source Whisper community benchmarks against.
Optional language hint sharpens short-clip detection
Quickstart · pick a runtime
Three minimal calls to /v1/audio/transcriptions with gpt-4o-transcribe. Export your key as OPENAI_API_KEY first — get one from the OpenAI dashboard. Never hard-code the key in source.
Official openai Python SDK · transcribe a local file with gpt-4o-transcribe.
# pip install --upgrade openai
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
with open("audio.mp3", "rb") as f:
resp = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=f,
# response_format="verbose_json",
# timestamp_granularities=["segment", "word"], # whisper-1 + gpt-4o-transcribe
# language="en", # ISO-639-1 hint
)
print(resp.text)
Official openai Node / TypeScript SDK · same call from Node 18+.
// npm install openai
import fs from "node:fs";
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const resp = await client.audio.transcriptions.create({
model: "gpt-4o-transcribe",
file: fs.createReadStream("audio.mp3"),
// response_format: "verbose_json",
// timestamp_granularities: ["segment", "word"],
// language: "en",
});
console.log(resp.text);
Plain HTTPS POST to /v1/audio/transcriptions · works from shell, CI, and edge runtimes.
# Bearer auth via $OPENAI_API_KEY · multipart upload
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="gpt-4o-transcribe"
# Subtitle output: add -F response_format="srt"
# Word timestamps (whisper-1 / gpt-4o-transcribe):
# -F response_format="verbose_json" \
# -F "timestamp_granularities[]=word" -F "timestamp_granularities[]=segment"
Features
| Speaker diarization | No |
| Word-level timestamps | Yes |
| Streaming / real-time | No |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- platform.openai.com/docs/guides/speech-to-text ↗Canonical Speech-to-text guide — model list, response_format options, timestamp_granularities, chunking patterns, language hints.
- API reference · createTranscription ↗Full parameter surface for /v1/audio/transcriptions — file, model, prompt, response_format, temperature, language, timestamp_granularities.
- openai.com/api/pricing ↗Live per-minute rates for gpt-4o-transcribe ($0.006/min), gpt-4o-mini-transcribe ($0.003/min), and whisper-1 ($0.006/min).
- status.openai.com ↗Live status for the API, including Audio endpoints — subscribe via email or RSS.
- openai/openai-python ↗Official Python SDK — v2.x line, audio.transcriptions.create + audio.translations.create, sync + async clients.
- openai/openai-node ↗Official JavaScript / TypeScript SDK — Node 18+, browser, edge runtimes; same audio.transcriptions surface.
- openai/openai-cookbook ↗Official examples repo — see the examples/ folder for audio-chunking patterns (pydub silence-split) and verbose_json post-processing.
- platform.openai.com/docs/guides/realtime ↗Realtime API guide — gpt-4o-realtime-preview over WebSocket / WebRTC for live STT, voice agents, and barge-in conversation loops.
OpenAI Whisper API vs Whipscribe
| Feature | OpenAI Whisper API | Whipscribe |
|---|---|---|
| Category | Transcription APIs | Transcription APIs |
| Pricing | $0.006/min | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | Yes | Yes |
| Streaming | No | No |
| Languages | 99 | 99 |
| Platforms | API | Web, API, MCP |
Sources & dates for the comparison above
- diarization: “The transcription API returns text and segments; speaker labels are not produced.” — source (checked 2026-04-23)
- word timestamps: “Set timestamp_granularities[] to word to receive per-word timestamps.” — source (checked 2026-04-23)
- streaming: “The transcription endpoint accepts complete audio files; streaming input is not supported.” — source (checked 2026-04-23)
- pricing: “Whisper: $0.006 / minute (rounded to the nearest second)” — source (checked 2026-04-24)
Alternatives to OpenAI Whisper API
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.