OpenAI Realtime API (STT)
OpenAI's Realtime API streaming speech-in (whisper-1 / gpt-4o-transcribe family).
OpenAI's Realtime API streaming speech-in (whisper-1 / gpt-4o-transcribe family).
Best for voice-agent products needing streaming STT tightly coupled with GPT-4o reasoning. Pricing: per-minute audio in (model-dependent).
What it is
OpenAI's Realtime API provides bidirectional streaming voice in/voice out against the gpt-4o-realtime models, alongside dedicated transcription models (gpt-4o-transcribe, gpt-4o-mini-transcribe). It is the canonical pick for voice-agent applications that need extremely tight loop between STT, LLM, and TTS. Pricing is per-minute of audio input and output. See also the separate /audio/transcriptions endpoint for batch transcription. Best fit: voice-agent products needing streaming stt tightly coupled with gpt-4o reasoning. Caveats: no native speaker diarization; cost rises rapidly with bidirectional audio. Pricing as listed: per-minute audio in (model-dependent). Feature flags from vendor docs: streaming. Directory tags: commercial-api, voice-agent. Last vendor-page check: 2026-05-12.
Watch out for: No native speaker diarization; cost rises rapidly with bidirectional audio.
Install / use
WebSocket: wss://api.openai.com/v1/realtime?model=gpt-4o-realtime
Features
| Speaker diarization | No |
| Word-level timestamps | No |
| Streaming / real-time | Yes |
| Languages supported | 99 |
| HIPAA eligible | No |
OpenAI Realtime API (STT) vs Whipscribe
| Feature | OpenAI Realtime API (STT) | Whipscribe |
|---|---|---|
| Category | Transcription APIs | Transcription APIs |
| Pricing | per-minute audio in (model-dependent) | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | — | Yes |
| Streaming | Yes | No |
| Languages | 99 | 99 |
| Platforms | API | Web, API, MCP |
Alternatives to OpenAI Realtime API (STT)
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.