Transcription APIs transcription tools

OpenAI Whisper API

OpenAI

Hosted Whisper large-v3 from OpenAI — $0.006 per minute.

$0.006/min

AssemblyAI

Universal-2 model + diarization, PII redaction, topic detection, summarization.

from $0.37/hr

Deepgram

Nova-2 model, excellent streaming, strong at conversational audio.

from $0.0043/min

Rev.ai

The API spin-off of Rev — strong English accuracy, topic detection, custom vocab.

from $0.02/min

Gladia

Whisper-based API with diarization, 99-language coverage, pay-per-minute.

from $0.0102/min

Speechmatics

Enterprise ASR with strong accents and on-prem deployment options.

contact sales

Whipscribe

Neugence

Hosted faster-whisper + whisperX with paste-a-URL, batch, and MCP access.

This is us

Amazon Transcribe

Amazon Web Services

AWS managed speech-to-text with batch + streaming, custom vocabulary, and medical/call-analytics variants.

from $0.0240/min (standard batch, tiered)

Amazon Transcribe Medical

Amazon Web Services

HIPAA-eligible medical-specialty ASR from AWS for clinical conversations and dictation.

from $0.075/min

Azure AI Speech (Speech-to-Text)

Microsoft Azure

Microsoft Azure's managed STT with batch, real-time, custom speech, and conversation transcription.

from $1/hr (standard) and $0.30/hr (batch transcription)

Google Cloud Speech-to-Text

Google Cloud

GCP Speech v2 with Chirp 2 foundation model, batch + streaming, 125+ language variants.

from $0.016/min (v2 standard) / Chirp tiered

Google Chirp / Chirp 2

Google Cloud

Google's universal speech foundation model exposed via Speech-to-Text v2.

per Speech-to-Text v2 pricing (region-tiered)

IBM Watson Speech to Text

IBM

IBM Cloud's managed ASR with on-prem option, custom acoustic + language models.

Lite (free, capped) + Plus tier (~$0.01/min, tiered)

Oracle Cloud AI Speech

Oracle Cloud Infrastructure

OCI managed speech-to-text with batch + real-time and Whisper-based models.

tiered per-minute (see OCI pricing page)

Alibaba Cloud Intelligent Speech Interaction

Alibaba Cloud

Alibaba's managed Chinese-first ASR with batch + real-time and customizable hotwords.

tiered RMB-per-second pricing

Tencent Cloud ASR

Tencent Cloud

Tencent's managed Chinese-first ASR with one-sentence, real-time, and recording-file modes.

tiered RMB-per-second pricing

Baidu Speech

Baidu AI Cloud

Baidu AI Cloud's Chinese-first speech recognition family.

free tier + tiered RMB-per-call

Yandex SpeechKit

Yandex Cloud

Yandex Cloud's managed Russian-first STT + TTS with batch and streaming.

tiered per-second (RUB-denominated)

Sber SaluteSpeech

Sber (Salute)

Sber's Russian-language speech recognition + synthesis platform.

tiered RUB-per-second (see SmartMarket pricing)

Huawei Cloud Speech Interaction Service

Huawei Cloud

Huawei Cloud's managed ASR + TTS with one-sentence, real-time, and long-audio modes.

tiered per-call (China + international regions)

iFlyTek Open Platform Speech

iFlyTek

iFlyTek's market-leading Mandarin ASR family for enterprise and education.

tiered per-day call quotas (RMB)

Volcengine Speech (ByteDance)

Volcengine (ByteDance)

ByteDance's Volcengine speech-to-text platform powering Douyin/CapCut workflows.

tiered per-second (RMB)

Naver Clova Speech

Naver Cloud Platform

Naver Cloud's Korean-first ASR with batch + real-time and speaker diarization.

tiered KRW-per-second

Kakao Speech (Kakao i)

Kakao Enterprise

Kakao Enterprise's Korean speech recognition + synthesis platform.

Contact sales

NTT Communications COTOHA Voice

NTT Communications

NTT Com's Japanese-first STT under the COTOHA AI platform.

tiered JPY-denominated

AISpeech (Sipeed/iflyOS-class)

AISpeech

Chinese embedded ASR specialist for IoT devices and on-device speech.

Contact sales

Soniox

Real-time multilingual ASR API with low-latency streaming and code-switching support.

per-minute (see Soniox pricing page)

ElevenLabs Scribe

ElevenLabs

ElevenLabs' speech-to-text API as a counterpart to its TTS, multilingual, word-timestamped.

from $0.40/hr (see ElevenLabs pricing page)

Sieve

Video-AI workflow platform with Whisper-based transcription endpoints.

per-second compute (see Sieve pricing)

Replicate (Whisper hosts)

Replicate

Replicate's catalog of community-hosted Whisper variants behind one API.

per-second GPU compute

Modal (ASR endpoints)

Modal Labs

Modal's serverless GPU platform commonly used to host Whisper / faster-whisper as an API.

per-second GPU compute (see Modal pricing)

RunPod (Whisper endpoints)

RunPod

RunPod's GPU cloud commonly used to deploy Whisper / faster-whisper as a serverless endpoint.

per-second GPU compute

fal.ai (Whisper / wizper endpoints)

fal.ai

fal.ai's hosted Whisper-family endpoints — low-latency, pay-per-second.

per-second compute (see fal.ai pricing)

Groq (Whisper endpoints)

Groq

Groq's LPU-based Whisper-large-v3 endpoint — exceptionally low-latency transcription.

from $0.111/hr (whisper-large-v3, batch)

OpenAI Realtime API (STT)

OpenAI

OpenAI's Realtime API streaming speech-in (whisper-1 / gpt-4o-transcribe family).

per-minute audio in (model-dependent)

Vatis Tech

Romanian-headquartered transcription API with strong CEE language coverage.

Free tier + paid hours (see Vatis Tech pricing)

Wit.ai (Meta)

Meta

Meta's free natural-language and speech understanding platform.

free

Vonage Voice API (ASR Connector)

Vonage

Vonage's CPaaS speech-to-text via the ASR connector (typically Deepgram-powered).

Vonage Voice price + ASR per-minute

Plivo Voice (Speech Recognition)

Plivo

Plivo's CPaaS speech recognition for IVR + call-recording workflows.

Plivo Voice + per-minute ASR

Bandwidth Voice (Transcription)

Bandwidth

Bandwidth's voice CPaaS with optional transcription on recordings and IVR.

per-minute (see Bandwidth pricing)

Play.HT (STT endpoints)

Play.HT

Play.HT's transcription endpoint as a counterpart to its TTS family.

per-minute (see Play.HT pricing)

Lemonfox.ai

Hosted Whisper API at low per-hour pricing for developers.

from $0.17/hr (Whisper)

Speakbot (Whisper API alt)

Speakbot

Hosted Whisper API with file-based and URL ingestion.

per-minute (see Speakbot pricing)

Voicegain

Deep-learning ASR you can deploy in your own cloud or use as managed SaaS.

per-minute SaaS + Edge license

Amazon Transcribe Streaming

Amazon Web Services

Real-time streaming variant of Amazon Transcribe over HTTP/2 + WebSocket.

from $0.024/min

Azure Fast Transcription

Microsoft Azure

Azure Speech's batch-fast mode for short-turnaround transcription with predictable latency.

per-Azure-Speech pricing (Fast variant)

Google Cloud Speaker Diarization

Google Cloud

Diarization layer for Google Cloud Speech-to-Text v2.

per-Speech v2 pricing

Deepgram Nova-3

Deepgram

Deepgram's current-generation streaming + batch ASR model.

from $0.0043/min (Nova-3, batch)

AssemblyAI Realtime / Streaming

AssemblyAI

AssemblyAI's WebSocket streaming endpoint for live captions and agents.

from $0.15/hr (Streaming)

Gladia Realtime

Gladia

Gladia's real-time streaming ASR API with multilingual code-switching.

per-hour streaming (see Gladia pricing)

OpenAI /audio/transcriptions (whisper-1, gpt-4o-transcribe)

OpenAI

OpenAI's hosted Whisper + gpt-4o-transcribe models, batch endpoint.

from $0.006/min (whisper-1)

OpenAI /audio/translations

OpenAI

OpenAI's translate-to-English audio endpoint.

from $0.006/min (whisper-1)

SambaNova (Whisper endpoints)

SambaNova

SambaNova's hosted Whisper-large-v3 endpoint on its RDU accelerator.

see SambaNova pricing

Together AI (Whisper)

Together AI

Together AI's hosted Whisper models among its open-model catalog.

per-Together-AI pricing

DeepInfra (Whisper)

DeepInfra

DeepInfra's hosted Whisper endpoint with per-second GPU pricing.

per-Deepinfra pricing

OVHcloud AI Speech-to-Text

OVHcloud

OVHcloud's managed speech-to-text inside its sovereign EU cloud.

per-OVHcloud pricing

Scaleway AI Inference

Scaleway

Scaleway's GPU inference platform commonly used for hosted Whisper.

per-Scaleway pricing (compute-based)

Alibaba Tongyi Audio (Qwen-Audio)

Alibaba Cloud

Alibaba's Tongyi multimodal model exposed for transcription + audio understanding.

tiered RMB-per-token / per-second

Baidu ERNIE Speech

Baidu

Baidu's ERNIE-aligned speech models inside ERNIE Bot Cloud.

tiered RMB-per-call

Huawei Pangu Speech

Huawei Cloud

Huawei's Pangu foundation models extended to speech for enterprise scenarios.

tiered RMB-per-call

Tencent Hunyuan (audio modality)

Tencent Cloud

Tencent's Hunyuan multimodal model with audio understanding endpoints.

tiered RMB-per-token

Naver HyperCLOVA X (audio)

Naver Cloud Platform

Naver's HyperCLOVA X foundation model with audio understanding.

Naver Cloud HyperCLOVA pricing

Kakao Kanana Speech

Kakao Enterprise

Kakao's Kanana foundation-model family with audio understanding.

Contact sales

Rev.ai Streaming

Rev

Rev.ai's WebSocket streaming endpoint for live transcripts.

from $0.035/min (Streaming)

Speechmatics (batch / language packs)

Speechmatics

Speechmatics batch ASR with broad language pack catalog.

from $1.04/hr (Standard)

Hume EVI

Hume AI Inc.

Empathic voice interface with emotional-tone awareness.

pay-as-you-go

Otter.ai API

Otter.ai Inc.

Developer API access to Otter.ai's transcription engine.

contact sales

Rasa Pro

Rasa Technologies GmbH

Open-source-anchored conversational AI for enterprise.

free / contact sales

Google Cloud Dialogflow

Google Cloud

Google's conversational-AI platform for voice and chat agents.

usage-based

Amazon Lex

Amazon Web Services

AWS conversational-AI platform for voice and text bots.

usage-based

Microsoft Bot Framework

Microsoft Corporation

Microsoft's open-source SDK and platform for conversational bots.

free SDK + Azure costs

Rev VoiceHub

Rev

Rev's enterprise transcription and recording API platform.

paid

Trint API

Trint

Trint's transcription and translation API for newsrooms and media teams.

paid

iFlyTek Open Platform

iFlyTek

China's largest speech AI vendor — Mandarin, dialects, and 60+ languages via developer APIs.

tiered · free quota + pay-as-you-go in CNY

Tencent Cloud ASR

Tencent Cloud

Tencent's cloud speech-to-text with one-sentence, sentence, and real-time APIs.

tiered · per-second pricing in CNY

Alibaba DAMO ASR

Alibaba Cloud

Alibaba Cloud / DAMO Academy speech recognition with Paraformer non-autoregressive models.

tiered · per-hour pricing in CNY/USD

Volcengine Speech

ByteDance Volcano Engine

ByteDance's Volcano Engine speech-to-text — short, long, and streaming Mandarin ASR.

tiered · pay-as-you-go in CNY

Mobvoi Speech

Mobvoi

Mobvoi (Chumen Wenwen) speech APIs — Mandarin recognition behind TicWatch and Volkswagen voice.

enterprise · contact sales

NetEase Youdao ASR

NetEase Youdao

Youdao Cloud speech-to-text — Mandarin recognition behind Youdao Translator and dictionary pen.

tiered · per-character pricing in CNY

Sogou ASR

Sogou (Tencent)

Sogou (Tencent-owned) speech-to-text — input-method-grade Mandarin recognition.

enterprise · contact sales

Reverie Language Tech

Reverie Language Technologies

Reverie's Indic speech recognition — 11 Indian languages from one of Reliance Jio's group companies.

enterprise · contact sales

Bhashini

Government of India (Digital India Bhashini Division)

Government of India's national language platform — public ASR APIs for 22 official languages.

free for non-commercial; commercial tiers TBD

Sarvam AI

Sarvam AI — full-stack Indian foundation models including Saaras / Saaransh speech APIs.

tiered · per-minute pricing with free tier

Tinkoff VoiceKit

Tinkoff (T-Bank)

Tinkoff VoiceKit — Russian-language ASR + TTS used inside Tinkoff Bank's contact centre.

tiered · per-second pricing in RUB

SoundHound

SoundHound AI

SoundHound Houndify — multilingual voice AI platform with embedded and cloud ASR.

tiered · contact sales for enterprise pricing

Lelapa AI

Lelapa AI — South African startup building Vulavula speech and language tools for African languages.

tiered · pay-as-you-go with free tier

Intella

Intella — Arabic speech-to-text API focused on MSA and major Arabic dialects.

tiered · per-hour pricing in USD with free tier

Alvenir Danish ASR

Alvenir

Alvenir — Danish-language speech-to-text product from a Copenhagen startup.

tiered · pay-as-you-go in EUR/DKK

AI-Loop

AI-Loop — multilingual African-language speech and NLP infrastructure.

tiered · pay-as-you-go in USD

Hume EVI

Hume AI

Empathic Voice Interface — voice AI that reads and responds to emotion in speech.

see vendor pricing

Dialogflow CX

Google Cloud

Google Cloud's enterprise conversational AI platform with voice and chat channels.

see vendor pricing

Microsoft Bot Framework (Voice)

Microsoft

Microsoft's bot orchestration SDK with voice channels via Direct Line Speech.

see vendor pricing

IBM watsonx Assistant (Voice)

IBM

IBM's enterprise conversational AI platform with voice and contact-center integrations.

see vendor pricing

Vertex AI Conversation

Google Cloud

Google Cloud's LLM-native conversational AI builder with voice support.

see vendor pricing

Twilio Voice Intelligence + Agents

Twilio

Twilio's ASR, voice intelligence, and ConversationRelay primitives for voice agents.

see vendor pricing

Plivo AI

Plivo

Voice AI agent capability layered on Plivo's CPaaS voice network.

see vendor pricing

Bandwidth Voice AI

Bandwidth

AI voice tooling layered on Bandwidth's tier-1 U.S. carrier network.

see vendor pricing

Telnyx Voice AI

Telnyx

AI inference and voice agents on Telnyx's own carrier and GPU stack.

see vendor pricing

Voximplant

CPaaS with serverless VoxEngine scenarios and AI voice integrations.

see vendor pricing

Daily.co Voice

Daily.co

WebRTC infrastructure for realtime voice and video AI agents.

see vendor pricing

Deepgram Voice Agent API

Deepgram

Single API for low-latency voice agents bundling Deepgram ASR + LLM + TTS.

see vendor pricing

AssemblyAI LeMUR Voice

AssemblyAI

AssemblyAI's LLM framework over its ASR for voice intelligence and agents.

see vendor pricing

ElevenLabs Conversational AI

ElevenLabs

ElevenLabs' end-to-end voice agent API with ASR, LLM, and premium TTS.

see vendor pricing

Azure AI Speech Voice Agent

Microsoft

Microsoft Azure's bundle of Speech SDK + Bot Framework for voice agents.

see vendor pricing

Ultravox Agents

Fixie.ai

Speech-native LLM and hosted agent runtime by Fixie.ai.

see vendor pricing

OpenAI Realtime Agents SDK

OpenAI

OpenAI's Agents SDK pattern over the Realtime API for voice-native assistants.

see vendor pricing

Anthropic Voice Agent Patterns

Anthropic

Reference patterns for building voice agents with Anthropic Claude models.

see vendor pricing

Cartesia Voice Agent stack

Cartesia

Cartesia's Sonic TTS plus partner ASR/LLM for low-latency voice agents.

see vendor pricing

Pollyo

AI-dubbing API for video platforms — backend OEM rather than a creator-facing app.

paid

Camb.ai TTS

Camb.ai

Camb.ai's standalone text-to-speech surface — same MARS model that powers their dubbing.

paid

iSpeech

TTS + STT API with consumer text-reader apps.

freemium — API per-request, consumer apps free