AI4Bharat IndicVoices

by AI4Bharat (IIT Madras)

16kh Indic-language ASR corpus across 22 Indian languages.

TL;DR

16kh Indic-language ASR corpus across 22 Indian languages.

Best for indic-language ASR training across 22 official Indian languages. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
HuggingFace

What it is

IndicVoices from AI4Bharat is ~16k hours of speech across 22 Indian languages — the canonical Indic ASR training corpus. License: CC BY 4.0.

Best for: Indic-language ASR training across 22 official Indian languages.
Watch out for: CC BY 4.0 · spontaneous + read speech · variable per-language quantity. Cite: Javed et al., ACL 2024.

Install / use

from datasets import load_dataset; ds = load_dataset('ai4bharat/IndicVoices', 'hindi')

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported22
HIPAA eligibleNo

AI4Bharat IndicVoices vs Whipscribe

FeatureAI4Bharat IndicVoicesWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages2299
PlatformsHuggingFaceWeb, API, MCP

Alternatives to AI4Bharat IndicVoices

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.