Shrutilipi

by AI4Bharat (IIT Madras)

6457h Indic-language ASR corpus from All India Radio news broadcasts.

TL;DR

6457h Indic-language ASR corpus from All India Radio news broadcasts.

Best for indic-language ASR pretraining from broadcast news with naturally-aligned transcripts. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
HuggingFace

What it is

Shrutilipi is 6457 hours of All India Radio news broadcasts across 12 Indian languages, paired with the official AIR transcripts. License: CC BY 4.0.

Best for: Indic-language ASR pretraining from broadcast news with naturally-aligned transcripts.
Watch out for: CC BY 4.0 · AIR-sourced · news-domain bias. Cite: Bhogale et al., 2023.

Install / use

from datasets import load_dataset; ds = load_dataset('ai4bharat/Shrutilipi')

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported12
HIPAA eligibleNo

Shrutilipi vs Whipscribe

FeatureShrutilipiWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages1299
PlatformsHuggingFaceWeb, API, MCP

Alternatives to Shrutilipi

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.