YODAS

by CMU / WAVLab

500kh of YouTube speech across 100+ languages with CC-licensed subtitles.

TL;DR

500kh of YouTube speech across 100+ languages with CC-licensed subtitles.

Best for massive multilingual self-supervised pretraining; ASR pretraining for low-resource languages. Pricing: research-only.

Category
Open source
License
Stars
Last push
Pricing
research-only
Platforms
HuggingFace

What it is

YODAS (YouTube-Oriented Dataset for Audio and Speech) provides ~500k hours of speech across 140+ languages with CC-licensed subtitles, harvested from YouTube. The largest open multilingual speech corpus to date.

Best for: Massive multilingual self-supervised pretraining; ASR pretraining for low-resource languages.
Watch out for: CC BY 3.0 (subtitles) but underlying YouTube videos are individually-licensed · video access subject to YouTube TOS · per-language quality varies hugely. Cite: Li et al., ASRU 2023.

Install / use

from datasets import load_dataset; ds = load_dataset('espnet/yodas', 'en000')

Features

Speaker diarizationNo
Word-level timestampsYes
Streaming / real-timeNo
Languages supported140
HIPAA eligibleNo

YODAS vs Whipscribe

FeatureYODASWhipscribe
CategoryOpen sourceTranscription APIs
Pricingresearch-onlyfree beta
Speaker diarizationNoYes
Word timestampsYesYes
StreamingNoNo
Languages14099
PlatformsHuggingFaceWeb, API, MCP

Alternatives to YODAS

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.