VoxTube

by ID R&D

5kh weakly-labeled multilingual TTS corpus from YouTube — 50 languages.

TL;DR

5kh weakly-labeled multilingual TTS corpus from YouTube — 50 languages.

Best for speaker verification + multilingual TTS pretraining across 50 languages. Pricing: research-only.

Category
Open source
License
Stars
Last push
Pricing
research-only
Platforms
HuggingFace

What it is

VoxTube is a 5000-hour weakly-labeled multilingual corpus across 50 languages from YouTube. Designed for speaker-verification + TTS pretraining at scale. Research-only license.

Best for: Speaker verification + multilingual TTS pretraining across 50 languages.
Watch out for: Custom non-commercial · YouTube TOS · weakly-labeled · ~5000 hours. Cite: Yakovlev et al., Interspeech 2023.

Install / use

from datasets import load_dataset; ds = load_dataset('idrnd/VoxTube')

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported50
HIPAA eligibleNo

VoxTube vs Whipscribe

FeatureVoxTubeWhipscribe
CategoryOpen sourceTranscription APIs
Pricingresearch-onlyfree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages5099
PlatformsHuggingFaceWeb, API, MCP

Alternatives to VoxTube

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.