Wall Street Journal (WSJ)

by LDC

80h of read newspaper sentences — foundational read-speech ASR corpus from 1992.

TL;DR

80h of read newspaper sentences — foundational read-speech ASR corpus from 1992.

Best for historical ASR baselines + read-speech academic comparisons. Pricing: paid.

Category
Open source
License
Stars
Last push
Pricing
paid
Platforms
LDC

What it is

WSJ (LDC93S6A + LDC94S13A) is 80 hours of read Wall Street Journal sentences — the original ARPA HUB ASR benchmark. Still cited for legacy comparisons. LDC paid.

Best for: Historical ASR baselines + read-speech academic comparisons.
Watch out for: LDC license · paid · 16 kHz · WSJ0 + WSJ1 splits · 5k vs 20k vocab evaluation conditions. Cite: Paul & Baker, HLT 1992.

Install / use

https://catalog.ldc.upenn.edu/LDC93S6A  # LDC membership

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported1
HIPAA eligibleNo

Wall Street Journal (WSJ) vs Whipscribe

FeatureWall Street Journal (WSJ)Whipscribe
CategoryOpen sourceTranscription APIs
Pricingpaidfree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages199
PlatformsLDCWeb, API, MCP

Alternatives to Wall Street Journal (WSJ)

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.