Spotify Podcast Dataset (100K)

by Spotify Research

100k hours of English podcasts with metadata — TREC podcast evaluation corpus.

TL;DR

100k hours of English podcasts with metadata — TREC podcast evaluation corpus.

Best for long-form podcast summarization + retrieval research (TREC Podcasts track 2020–2021). Pricing: research-only.

Category
Open source
License
Stars
Last push
Pricing
research-only
Platforms
Web

What it is

The Spotify Podcast Dataset is 100k hours of English + 70k hours of Portuguese podcast audio + metadata + machine transcripts. Backed the TREC Podcasts track. Research-only.

Best for: Long-form podcast summarization + retrieval research (TREC Podcasts track 2020–2021).
Watch out for: Spotify research license · NON-COMMERCIAL · request form · English ~100kh + Portuguese ~70kh. Cite: Clifton et al., LREC 2020.

Install / use

https://podcastsdataset.byspotify.com/  # request access

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported2
HIPAA eligibleNo

Spotify Podcast Dataset (100K) vs Whipscribe

FeatureSpotify Podcast Dataset (100K)Whipscribe
CategoryOpen sourceTranscription APIs
Pricingresearch-onlyfree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages299
PlatformsWebWeb, API, MCP

Alternatives to Spotify Podcast Dataset (100K)

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.