IARPA Babel

by IARPA / LDC

Low-resource multilingual ASR + KWS corpora — 25+ languages from telephony.

TL;DR

Low-resource multilingual ASR + KWS corpora — 25+ languages from telephony.

Best for low-resource ASR + KWS evaluation across 25+ languages. Pricing: paid.

Category
Open source
License
Stars
Last push
Pricing
paid
Platforms
LDC

What it is

IARPA Babel ran 2011–2016, producing 25+ telephone-speech corpora for low-resource languages (Cantonese, Pashto, Tagalog, Turkish, Vietnamese, Lao, Zulu, Tamil, ...). LDC paid.

Best for: Low-resource ASR + KWS evaluation across 25+ languages.
Watch out for: LDC license · paid · US-government-funded · telephony 8kHz · variable quantity per language. Cite: Harper, 2013.

Install / use

https://catalog.ldc.upenn.edu/search?searchString=Babel  # LDC paid

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported25
HIPAA eligibleNo

IARPA Babel vs Whipscribe

FeatureIARPA BabelWhipscribe
CategoryOpen sourceTranscription APIs
Pricingpaidfree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages2599
PlatformsLDCWeb, API, MCP

Alternatives to IARPA Babel

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.