The People's Speech

by MLCommons

30,000h CC-BY-licensed English ASR corpus — Internet-Archive sourced.

TL;DR

30,000h CC-BY-licensed English ASR corpus — Internet-Archive sourced.

Best for commercial-friendly English ASR training (CC-BY allows commercial use) at scale. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
HuggingFace

What it is

MLCommons People's Speech is a 30k-hour CC-BY English ASR corpus — the largest commercially-permissive labeled English speech corpus. Two splits: 12kh clean (CC BY-SA) + 18kh dirty (CC BY-NC-SA).

Best for: Commercial-friendly English ASR training (CC-BY allows commercial use) at scale.
Watch out for: CC BY-SA 4.0 (clean) · CC BY-NC-SA 4.0 (dirty) · Internet Archive crawl · long-form audio with alignment noise · ~20% transcripts via forced alignment. Cite: Galvez et al., NeurIPS 2021.

Install / use

from datasets import load_dataset; ds = load_dataset('MLCommons/peoples_speech', 'clean')

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported1
HIPAA eligibleNo

The People's Speech vs Whipscribe

FeatureThe People's SpeechWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages199
PlatformsHuggingFaceWeb, API, MCP

Alternatives to The People's Speech

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.