Telugu Speech Corpus

by Indian academic community

Open Telugu-language speech corpora and models for SE-Indian transcription.

TL;DR

Open Telugu-language speech corpora and models for SE-Indian transcription.

Best for telugu-language transcription for journalism, edtech, and civic-tech projects. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
Linux

What it is

Telugu is one of India's largest languages by speaker count but remains underserved by international cloud STT. Indian academic groups (notably at IIT Hyderabad and IIIT-H) have published Telugu speech corpora and fine-tuned models. Combined with AI4Bharat's IndicConformer-Telugu split, these are the practical foundation for production Telugu ASR. Best fit when the buyer is telugu-language transcription for journalism, edtech, and civic-tech projects. The honest caveat: distributed releases; quality varies between published checkpoints. As with any open-weights release, the integrator owns hosting, scaling, and SLA — but the licensing cost is zero and the model can be fine-tuned on in-house audio.

Best for: Telugu-language transcription for journalism, edtech, and civic-tech projects.
Watch out for: Distributed releases; quality varies between published checkpoints.

Install / use

huggingface.co search 'telugu asr' for model cards

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported1
HIPAA eligibleNo

Telugu Speech Corpus vs Whipscribe

FeatureTelugu Speech CorpusWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages199
PlatformsLinuxWeb, API, MCP

Alternatives to Telugu Speech Corpus

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.