Looking at OpenAI Whisper API? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

OpenAI Whisper API

Name: OpenAI Whisper API
Price: 0.006 USD
Author: OpenAI

by OpenAI

OpenAI's hosted Whisper API — gpt-4o-transcribe is the newest model, gpt-4o-mini-transcribe is the cheaper sibling, whisper-1 is the legacy reference. One REST surface.

TL;DR

OpenAI exposes hosted speech-to-text on three models behind a single REST endpoint, POST https://api.openai.com/v1/audio/transcriptions: gpt-4o-transcribe and gpt-4o-mini-transcribe (newer GPT-4o-class transcribers, faster and tuned for noisier audio) and whisper-1 (the legacy reference model running Whisper large-v2). The same surface also serves audio-to-English at /v1/audio/translations, and the parallel Realtime API (gpt-4o-realtime-preview) carries streaming STT over WebSocket / WebRTC for live agents.

Best for teams already paying for OpenAI infra, anyone mixing transcription with chat / function-calling in the same SDK, or workflows where vendor consolidation beats per-minute price. Current per-minute rates: gpt-4o-transcribe at $0.006/min, gpt-4o-mini-transcribe at $0.003/min, whisper-1 at $0.006/min, billed to the nearest second. Hard limits: 25 MB per request, formats mp3 / mp4 / mpeg / mpga / m4a / wav / webm, no built-in diarization, no batch discount.

What it is

OpenAI's hosted Whisper API is the easiest way to get Whisper-grade transcription without running infrastructure. $0.006 per minute, pay-as-you-go. No diarization, no streaming — if you need those, pick a different endpoint or self-host whisperX. Last price check: 2026-04-20.

Best for: Teams already on OpenAI's stack who want Whisper without operating a GPU.
Watch out for: 25 MB file size limit; no diarization; no batch discount; latency dominated by upload for large files.

Install / use

POST https://api.openai.com/v1/audio/transcriptions

View OpenAI Audio API docs ↗

Where the Whisper API fits · 6 use-cases

OpenAI's audio surface is small but covers the common shapes: subtitles, translation, long-form chunking, real-time, telephony, and multilingual. Each card points at the matching section on platform.openai.com — pick the closest one and copy a recipe below.

Subtitles · VTT / SRT

response_format=srt or vtt

Pass response_format=srt or response_format=vtt to /v1/audio/transcriptions and the API returns a subtitle file directly — no client-side stitching. For finer timing pass response_format=verbose_json with timestamp_granularities=['segment','word'] to get per-segment and per-word offsets on whisper-1 or gpt-4o-transcribe.

gpt-4o-transcribe · whisper-1
verbose_json + timestamp_granularities for word-level offsets

Auto-translation to English

/v1/audio/translations

The sibling endpoint /v1/audio/translations transcribes non-English audio directly into English text in one call — useful when downstream tooling is English-only. Currently supported on whisper-1; for other source/target pairs run /transcriptions then a chat model.

whisper-1
Translation task is English-only output

Long-form audio · chunking

25 MB request cap

Files over 25 MB must be split client-side before upload. The OpenAI cookbook pattern uses pydub to slice on silence boundaries, transcribe each chunk in parallel, then concatenate. Alternative: compress to 16 kHz mono opus / m4a to fit a 60-90 min episode under 25 MB.

any model
Hard 25 MB cap per request — no server-side chunking

Real-time · live agents

Realtime API · WebSocket

For sub-second streaming STT use the Realtime API (gpt-4o-realtime-preview), not /audio/transcriptions — the REST endpoint is request/response only. Realtime carries bidirectional audio over WebSocket or WebRTC and is the path for voice agents, live captions, and IVR replacements.

gpt-4o-realtime-preview
Separate SKU and endpoint from /audio/transcriptions

Calls · voicemail · meetings

gpt-4o-mini-transcribe · cheapest

gpt-4o-mini-transcribe at $0.003/min is the budget pick for high-volume call recordings and voicemail batches where every cent matters and word-level timestamps are optional. No diarization in the response — pair with a downstream speaker-attribution model or a tool that bundles it.

gpt-4o-mini-transcribe
Half the per-minute cost of gpt-4o-transcribe and whisper-1

Multilingual · 50+ languages

Auto language detect

All three models auto-detect the input language; pass an ISO-639-1 language hint (e.g. language='ja') to skip detection and improve accuracy on short clips. Quality varies by language — large-v2 weights underneath whisper-1 are the same set the open-source Whisper community benchmarks against.

all 3 models
Optional language hint sharpens short-clip detection

Pattern: the hosted OpenAI surface is the right call when consolidation, GPT-4o-class accuracy on noisy audio, or zero ops matter more than raw price. If sub-second streaming + voice-agent loop is the product, Deepgram Nova-3 is closer to purpose-built. For self-hosted control on the same Whisper weights, see openai-whisper (reference) or faster-whisper (production swap-in). For a managed transcript with no API wiring, drop a file into Whipscribe.

Quickstart · pick a runtime

Three minimal calls to /v1/audio/transcriptions with gpt-4o-transcribe. Export your key as OPENAI_API_KEY first — get one from the OpenAI dashboard. Never hard-code the key in source.

1Python SDK · openai-python v2.x

Official openai Python SDK · transcribe a local file with gpt-4o-transcribe.

# pip install --upgrade openai
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

with open("audio.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=f,
        # response_format="verbose_json",
        # timestamp_granularities=["segment", "word"],  # whisper-1 + gpt-4o-transcribe
        # language="en",                                # ISO-639-1 hint
    )

print(resp.text)

Source: openai/openai-python ↗ · API reference: platform.openai.com/docs/api-reference/audio/createTranscription ↗

2Node SDK · openai-node v4+

Official openai Node / TypeScript SDK · same call from Node 18+.

// npm install openai
import fs from "node:fs";
import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const resp = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: fs.createReadStream("audio.mp3"),
  // response_format: "verbose_json",
  // timestamp_granularities: ["segment", "word"],
  // language: "en",
});

console.log(resp.text);

Source: openai/openai-node ↗

3cURL · multipart/form-data

Plain HTTPS POST to /v1/audio/transcriptions · works from shell, CI, and edge runtimes.

# Bearer auth via $OPENAI_API_KEY · multipart upload
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@audio.mp3" \
  -F model="gpt-4o-transcribe"

# Subtitle output: add -F response_format="srt"
# Word timestamps (whisper-1 / gpt-4o-transcribe):
#   -F response_format="verbose_json" \
#   -F "timestamp_granularities[]=word" -F "timestamp_granularities[]=segment"

Source: platform.openai.com/docs/api-reference/audio/createTranscription ↗

Features

Speaker diarization	No
Word-level timestamps	Yes
Streaming / real-time	No
Languages supported	99
HIPAA eligible	No