Transcription APIs transcription tools
Hosted transcription endpoints you call with an API key — no infrastructure to manage.
111 tools · updated 2026-05-15Hosted Whisper large-v3 from OpenAI — $0.006 per minute.
Universal-2 model + diarization, PII redaction, topic detection, summarization.
Nova-2 model, excellent streaming, strong at conversational audio.
The API spin-off of Rev — strong English accuracy, topic detection, custom vocab.
Whisper-based API with diarization, 99-language coverage, pay-per-minute.
Enterprise ASR with strong accents and on-prem deployment options.
Hosted faster-whisper + whisperX with paste-a-URL, batch, and MCP access.
AWS managed speech-to-text with batch + streaming, custom vocabulary, and medical/call-analytics variants.
HIPAA-eligible medical-specialty ASR from AWS for clinical conversations and dictation.
Microsoft Azure's managed STT with batch, real-time, custom speech, and conversation transcription.
GCP Speech v2 with Chirp 2 foundation model, batch + streaming, 125+ language variants.
Google's universal speech foundation model exposed via Speech-to-Text v2.
IBM Cloud's managed ASR with on-prem option, custom acoustic + language models.
OCI managed speech-to-text with batch + real-time and Whisper-based models.
Alibaba's managed Chinese-first ASR with batch + real-time and customizable hotwords.
Tencent's managed Chinese-first ASR with one-sentence, real-time, and recording-file modes.
Baidu AI Cloud's Chinese-first speech recognition family.
Yandex Cloud's managed Russian-first STT + TTS with batch and streaming.
Sber's Russian-language speech recognition + synthesis platform.
Huawei Cloud's managed ASR + TTS with one-sentence, real-time, and long-audio modes.
iFlyTek's market-leading Mandarin ASR family for enterprise and education.
ByteDance's Volcengine speech-to-text platform powering Douyin/CapCut workflows.
Naver Cloud's Korean-first ASR with batch + real-time and speaker diarization.
Kakao Enterprise's Korean speech recognition + synthesis platform.
NTT Com's Japanese-first STT under the COTOHA AI platform.
Chinese embedded ASR specialist for IoT devices and on-device speech.
Real-time multilingual ASR API with low-latency streaming and code-switching support.
ElevenLabs' speech-to-text API as a counterpart to its TTS, multilingual, word-timestamped.
Video-AI workflow platform with Whisper-based transcription endpoints.
Replicate's catalog of community-hosted Whisper variants behind one API.
Modal's serverless GPU platform commonly used to host Whisper / faster-whisper as an API.
RunPod's GPU cloud commonly used to deploy Whisper / faster-whisper as a serverless endpoint.
fal.ai's hosted Whisper-family endpoints — low-latency, pay-per-second.
Groq's LPU-based Whisper-large-v3 endpoint — exceptionally low-latency transcription.
OpenAI's Realtime API streaming speech-in (whisper-1 / gpt-4o-transcribe family).
Romanian-headquartered transcription API with strong CEE language coverage.
Meta's free natural-language and speech understanding platform.
Vonage's CPaaS speech-to-text via the ASR connector (typically Deepgram-powered).
Plivo's CPaaS speech recognition for IVR + call-recording workflows.
Bandwidth's voice CPaaS with optional transcription on recordings and IVR.
Play.HT's transcription endpoint as a counterpart to its TTS family.
Hosted Whisper API at low per-hour pricing for developers.
Hosted Whisper API with file-based and URL ingestion.
Deep-learning ASR you can deploy in your own cloud or use as managed SaaS.
Real-time streaming variant of Amazon Transcribe over HTTP/2 + WebSocket.
Azure Speech's batch-fast mode for short-turnaround transcription with predictable latency.
Diarization layer for Google Cloud Speech-to-Text v2.
Deepgram's current-generation streaming + batch ASR model.
AssemblyAI's WebSocket streaming endpoint for live captions and agents.
Gladia's real-time streaming ASR API with multilingual code-switching.
OpenAI's hosted Whisper + gpt-4o-transcribe models, batch endpoint.
OpenAI's translate-to-English audio endpoint.
SambaNova's hosted Whisper-large-v3 endpoint on its RDU accelerator.
Together AI's hosted Whisper models among its open-model catalog.
DeepInfra's hosted Whisper endpoint with per-second GPU pricing.
OVHcloud's managed speech-to-text inside its sovereign EU cloud.
Scaleway's GPU inference platform commonly used for hosted Whisper.
Alibaba's Tongyi multimodal model exposed for transcription + audio understanding.
Baidu's ERNIE-aligned speech models inside ERNIE Bot Cloud.
Huawei's Pangu foundation models extended to speech for enterprise scenarios.
Tencent's Hunyuan multimodal model with audio understanding endpoints.
Naver's HyperCLOVA X foundation model with audio understanding.
Kakao's Kanana foundation-model family with audio understanding.
Rev.ai's WebSocket streaming endpoint for live transcripts.
Speechmatics batch ASR with broad language pack catalog.
Empathic voice interface with emotional-tone awareness.
Developer API access to Otter.ai's transcription engine.
Open-source-anchored conversational AI for enterprise.
Google's conversational-AI platform for voice and chat agents.
AWS conversational-AI platform for voice and text bots.
Microsoft's open-source SDK and platform for conversational bots.
Rev's enterprise transcription and recording API platform.
Trint's transcription and translation API for newsrooms and media teams.
China's largest speech AI vendor — Mandarin, dialects, and 60+ languages via developer APIs.
Tencent's cloud speech-to-text with one-sentence, sentence, and real-time APIs.
Alibaba Cloud / DAMO Academy speech recognition with Paraformer non-autoregressive models.
ByteDance's Volcano Engine speech-to-text — short, long, and streaming Mandarin ASR.
Mobvoi (Chumen Wenwen) speech APIs — Mandarin recognition behind TicWatch and Volkswagen voice.
Youdao Cloud speech-to-text — Mandarin recognition behind Youdao Translator and dictionary pen.
Sogou (Tencent-owned) speech-to-text — input-method-grade Mandarin recognition.
Reverie's Indic speech recognition — 11 Indian languages from one of Reliance Jio's group companies.
Government of India's national language platform — public ASR APIs for 22 official languages.
Sarvam AI — full-stack Indian foundation models including Saaras / Saaransh speech APIs.
Tinkoff VoiceKit — Russian-language ASR + TTS used inside Tinkoff Bank's contact centre.
SoundHound Houndify — multilingual voice AI platform with embedded and cloud ASR.
Lelapa AI — South African startup building Vulavula speech and language tools for African languages.
Intella — Arabic speech-to-text API focused on MSA and major Arabic dialects.
Alvenir — Danish-language speech-to-text product from a Copenhagen startup.
AI-Loop — multilingual African-language speech and NLP infrastructure.
Empathic Voice Interface — voice AI that reads and responds to emotion in speech.
Google Cloud's enterprise conversational AI platform with voice and chat channels.
Microsoft's bot orchestration SDK with voice channels via Direct Line Speech.
IBM's enterprise conversational AI platform with voice and contact-center integrations.
Google Cloud's LLM-native conversational AI builder with voice support.
Twilio's ASR, voice intelligence, and ConversationRelay primitives for voice agents.
Voice AI agent capability layered on Plivo's CPaaS voice network.
AI voice tooling layered on Bandwidth's tier-1 U.S. carrier network.
AI inference and voice agents on Telnyx's own carrier and GPU stack.
CPaaS with serverless VoxEngine scenarios and AI voice integrations.
WebRTC infrastructure for realtime voice and video AI agents.
Single API for low-latency voice agents bundling Deepgram ASR + LLM + TTS.
AssemblyAI's LLM framework over its ASR for voice intelligence and agents.
ElevenLabs' end-to-end voice agent API with ASR, LLM, and premium TTS.
Microsoft Azure's bundle of Speech SDK + Bot Framework for voice agents.
Speech-native LLM and hosted agent runtime by Fixie.ai.
OpenAI's Agents SDK pattern over the Realtime API for voice-native assistants.
Reference patterns for building voice agents with Anthropic Claude models.
Cartesia's Sonic TTS plus partner ASR/LLM for low-latency voice agents.
AI-dubbing API for video platforms — backend OEM rather than a creator-facing app.
Camb.ai's standalone text-to-speech surface — same MARS model that powers their dubbing.
TTS + STT API with consumer text-reader apps.