Russian Open STT

by Silero

20kh Russian ASR corpus — the largest open Russian-language speech dataset.

TL;DR

20kh Russian ASR corpus — the largest open Russian-language speech dataset.

Best for russian ASR training and evaluation at scale. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
GitHub

What it is

Open STT (Silero) is a 20k-hour Russian ASR corpus assembled from YouTube + audiobooks + public speech + radio. License: CC BY-NC 4.0.

Best for: Russian ASR training and evaluation at scale.
Watch out for: CC BY-NC 4.0 · NON-COMMERCIAL · multiple subsets (audiobooks, YouTube, public speech). Cite: Veysov, 2019.

Install / use

git clone https://github.com/snakers4/open_stt

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported1
HIPAA eligibleNo

Russian Open STT vs Whipscribe

FeatureRussian Open STTWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages199
PlatformsGitHubWeb, API, MCP

Alternatives to Russian Open STT

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.