Spoken Wikipedia Corpora

by University of Bielefeld

Long-form Wikipedia audiobook recordings in English / German / Dutch — ~1000h.

TL;DR

Long-form Wikipedia audiobook recordings in English / German / Dutch — ~1000h.

Best for long-form ASR with article-level context; alignment research. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
Web

What it is

Spoken Wikipedia Corpora (SWC) collects volunteer-read Wikipedia articles in English (~395h), German (~386h), Dutch (~165h). All aligned at sentence + word level. License: CC BY-SA.

Best for: Long-form ASR with article-level context; alignment research.
Watch out for: CC BY-SA (Wikipedia text + audio) · variable speaker count per article. Cite: Köhn et al., LREC 2016.

Install / use

https://nats.gitlab.io/swc/

Features

Speaker diarizationNo
Word-level timestampsYes
Streaming / real-timeNo
Languages supported3
HIPAA eligibleNo

Spoken Wikipedia Corpora vs Whipscribe

FeatureSpoken Wikipedia CorporaWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsYesYes
StreamingNoNo
Languages399
PlatformsWebWeb, API, MCP

Alternatives to Spoken Wikipedia Corpora

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.