Kencorpus Swahili ASR

by Kencorpus consortium

Kencorpus / Maseno — Kenyan Swahili and English code-switch speech dataset and baselines.

TL;DR

Kencorpus / Maseno — Kenyan Swahili and English code-switch speech dataset and baselines.

Best for swahili and Swahili-English code-switching transcription in East African contexts. Pricing: free.

Category
Open source
License
Stars
Last push
Pricing
free
Platforms
Web

What it is

Kencorpus is a Kenyan academic-led initiative that published a multilingual Swahili-English-Dholuo speech and text corpus, along with baseline ASR models. The resource is foundational for any East-African transcription project — particularly those that need to handle Swahili-English code-switching, which is the dominant register in Nairobi and other urban markets. Best fit when the buyer is swahili and swahili-english code-switching transcription in east african contexts. The honest caveat: primarily a dataset; productisation is the integrator's responsibility. As with any open-weights release, the integrator owns hosting, scaling, and SLA — but the licensing cost is zero and the model can be fine-tuned on in-house audio.

Best for: Swahili and Swahili-English code-switching transcription in East African contexts.
Watch out for: Primarily a dataset; productisation is the integrator's responsibility.

Install / use

huggingface.co search 'kencorpus' for dataset and model cards

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supportedNone
HIPAA eligibleNo

Kencorpus Swahili ASR vs Whipscribe

FeatureKencorpus Swahili ASRWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingNoNo
Languages99
PlatformsWebWeb, API, MCP

Alternatives to Kencorpus Swahili ASR

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.