Open source transcription tools
Self-hostable transcription engines and desktop apps you can run yourself, with source you can read and modify.
357 tools · updated 2026-05-15The reference open-source multilingual ASR model from OpenAI.
C/C++ port of Whisper — runs on anything, from a Raspberry Pi to Apple Silicon.
4× faster than reference Whisper using CTranslate2 — production sweet spot.
Faster-whisper + forced alignment + speaker diarization in one pipeline.
CLI that transcribes 150 minutes of audio in ~98 seconds on an A100.
Whisper with stabilised timestamps — more accurate word-level timing.
Swift Whisper for Apple Silicon — CoreML, ANE, Metal. Now part of the Argmax Open-Source SDK (v1.0.0, May 2026) alongside SpeakerKit + TTSKit.
Distilled Whisper: 6× faster, 49% smaller, within 1% WER of the teacher.
Meta's speech-to-text + speech-to-speech + text-to-speech model, 100 languages.
Lightweight offline speech recognition for 20+ languages, runs on a Raspberry Pi.
Cross-platform desktop app for Whisper — open-source MacWhisper alternative.
Open-source TTS model with strong prosody — slow on CPU.
Open-source TTS toolkit with multi-language voice models.
70x faster Whisper on TPUs via JAX + Flax + batching.
Whisper inference on Apple Silicon via Apple's MLX framework.
Idiomatic Rust bindings for whisper.cpp.
Whisper running on Windows via DirectCompute / GPGPU.
Single-EXE Whisper for Windows + Linux, no dependencies.
Command-line Whisper using CTranslate2 — closest match to openai/whisper CLI.
Python bindings for whisper.cpp with a simple iterator API.
Gradio web UI bundling faster-whisper + diarization + translation.
Real-time Whisper transcription over WebSockets.
Ultra-low-latency speech→LLM pipeline: WhisperLive + Mistral + TensorRT.
WhisperFusion's voice-chat reference app.
Academic real-time Whisper streaming with LocalAgreement-2.
Word-level timestamps for OpenAI Whisper without retraining.
Whisper + NeMo MSDD diarization pipeline.
OpenAI-compatible /v1/audio/transcriptions endpoint over faster-whisper.
Dockerized Whisper REST API with multiple backends.
Optimized batched Whisper engine with VAD + dynamic batching.
Mic-in-browser → real-time Whisper transcription demo.
Always-listening hot-mic Whisper transcriber.
Single-page web UI to generate subtitles via Whisper.
Whisper inverted into a TTS — also used as ASR-aware training data tool.
Easy-to-use speech toolkit: TTS, STT, alignment, language detection.
One-click Whisper + diarization + voice cloning Gradio app.
Whisper retrained for medical / clinical transcription accuracy.
Toolkit + model zoo behind Canary, Parakeet, Conformer, FastConformer.
Meta's multilingual speech-translation + transcription foundation suite.
Meta's seq-to-seq toolkit — home of wav2vec, HuBERT, XLS-R, MMS.
One API for Whisper, Wav2Vec2, HuBERT, XLS-R, SeamlessM4T, Parakeet.
ONNX + TensorRT + OpenVINO acceleration for Transformers ASR models.
Multi-GPU / mixed-precision launcher for any PyTorch ASR training script.
Streaming loader for Common Voice, LibriSpeech, GigaSpeech, FLEURS.
LoRA / adapters for parameter-efficient Whisper fine-tuning.
PyTorch toolkit for ASR, speaker, diarization, enhancement.
End-to-end speech toolkit: ASR, TTS, ST, speaker, separation.
The classic C++ HMM/DNN speech recognition toolkit.
FSA/FST framework written from scratch in PyTorch/CUDA.
ASR recipes (Conformer / Zipformer / Pruned Transducer) for k2 + sherpa.
Production server for k2/icefall + Whisper models (PyTorch).
ONNX-runtime ASR: Whisper, Zipformer, Paraformer on every platform.
ASR on NCNN — Android-friendly, CPU-only, no FP support needed.
Successor to Mozilla DeepSpeech, maintained by Coqui.
The original open RNN-T from Mozilla — archived but historic.
Production-first E2E ASR — U2++ Conformer, streaming + offline.
End-to-end speech recognition toolkit by ATHENA-OPEN-SOURCE.
Lightweight CMU Sphinx engine for embedded keyword spotting.
The classic Java speech engine from CMU.
Baidu's all-in-one speech toolkit on PaddlePaddle.
The DeepSpeech-style recipes inside PaddleSpeech.
Lightweight Japanese-focused open ASR with WFST decoding.
Word-level alignment via Kaldi for 100+ languages.
NVIDIA's TF1 framework — historical home of Jasper + QuartzNet.
RWTH's flexible neural network training framework for ASR research.
Alibaba DAMO's Paraformer / SenseVoice / Whisper toolkit.
Rev's open WFST-decoded ASR + diarization stack.
Rev.com's WER + alignment scoring tool over WFSTs.
Baidu's TTS half — included for end-to-end voice pipelines.
The reference open diarization + speaker embedding toolkit.
Hervé Bredin's personal mirror of pyannote.audio.
Tiny, accurate voice-activity-detection model — runs on CPU.
Python bindings for Google's WebRTC VAD.
Streaming speaker diarization on top of pyannote.
A minimal pyannote / SpeechBrain diarization wrapper.
Speaker-verification embeddings from a small generalist encoder.
Tiny English ASR optimized for resource-constrained devices.
Mirror of Useful Sensors' Moonshine releases.
Run Whisper / wav2vec2 entirely in the browser via ONNX Runtime Web.
Original transformers.js repo by Joshua Lochner (pre-merge into HF).
Apple's array framework — runs Whisper, Phi, Llama on Apple Silicon.
Swift bindings for MLX — embed Whisper in iOS/macOS apps.
Reference Swift apps for MLX, including Whisper.
Python MLX examples — Whisper, Llama, Stable Diffusion.
Audio + image data loaders for MLX training.
Minimalist Rust ML framework with Whisper support.
The tensor library underneath llama.cpp + whisper.cpp.
The new ggml-org home of whisper.cpp.
GGUF runtime — runs many ASR forks (whisper, parakeet, qwen-audio).
Microsoft's cross-platform inference runtime for ONNX-exported Whisper.
Open exchange format used by every ASR optimizer.
Intel's CPU/iGPU/NPU inference toolkit — Whisper-tuned.
Reference notebooks including Whisper + SeamlessM4T export.
BF16/AMX speedups for Whisper PyTorch inference on Intel CPUs.
High-throughput inference engine — supports Whisper / Llava / Qwen-Audio.
Structured generation runtime — supports Qwen-Audio / Phi-Multimodal.
NVIDIA's optimized inference for Whisper, Canary, Parakeet on Triton.
Legacy NVIDIA inference engine — predecessor to TensorRT-LLM.
Production inference server — runs audio-multimodal LLMs.
TensorFlow 2 end-to-end ASR — Conformer, ContextNet, DeepSpeech2.
Streaming Conformer + DeepSpeech2 in PyTorch for Mandarin.
Companion audio-classification training repo for MASR.
Multi-backend Python speech-recognition library.
Production-style speaker embedding + verification toolkit.
Open speech-separation toolkit aligned with WeNet ASR.
Meta's C++ ML library — homed wav2letter.
Standalone CTC / sequence decoders from Flashlight.
Meta's original fast convolutional ASR system.
Reference PyTorch implementation of the Conformer architecture.
Reference Speech-Transformer in PyTorch.
Home of WavLM, HuBERT++, Speech-T5, BEATs, VALL-E.
Unified speech-text Transformer (ASR + TTS + VC).
Post-processing for ASR: numbers, dates, units in 20+ languages.
Open clients for Riva — NVIDIA's commercial ASR/TTS server.
Reference SDK — covers the Whisper + Realtime audio endpoints.
Open conversational-AI stack with self-hosted ASR + NLP.
Production transcription microservice powering the LinTO stack.
Modular successor to fairseq used by Seamless models.
Eval harness — includes WER evaluations for ASR.
Free open course on audio ML, including Whisper fine-tuning.
Open ASR leaderboard (LibriSpeech, GigaSpeech, AISHELL).
Simultaneous speech-to-speech translation with streaming ASR.
Open TTS — relevant when pairing ASR with read-back TTS.
OSS 'second brain' that ingests transcripts via Whisper.
Unified audio foundation model (Codec + LM) — handles ASR.
Distributed Whisper / Conformer training at scale.
The deepspeedai-org home of DeepSpeed.
Microsoft's inference-side companion to DeepSpeed.
Model-optimization toolchain — Whisper ONNX/QNN/DirectML targets.
Underlying framework for whisper-jax and TPU ASR research.
JAX's new home under the JAX-ML org.
JAX neural-net library used by whisper-jax.
Subword tokenizer used by Whisper, SeamlessM4T, Canary.
The framework underlying TensorFlowASR + many older recipes.
Text ops + tokenizers integrated with TF ASR pipelines.
Google's research-grade TF framework — original Conformer code.
Tensor-parallel training — used for Speech-LLM scaling.
Mixed-precision / fused ops library used in NeMo training.
C++ / Python API for cuDNN — speeds up custom ASR kernels.
High-performance CUDA matrix kernels used by Whisper engines.
Open distributed framework — supports Whisper LoRA fine-tunes.
Compile + deploy LLMs (and Whisper) to phones / browsers / WebGPU.
Track / serve Whisper experiments and model registry.
Eval harness now covering audio-LLM benchmarks.
Fast neural TTS for Home Assistant — pairs with Whisper.
Mozilla's archived TTS — historical reference.
Reference Tacotron2 + WaveGlow stack from NVIDIA.
Flow-based vocoder companion to Tacotron2.
Multispeaker prosody TTS — historical NVIDIA release.
Multilingual TTS toolkit from Stuttgart IMS.
Alternate-case mirror of IMS Toucan.
Reference E2E TTS — building block for voice-agent loops.
Flow-based parallel TTS reference.
Transformer-based generative audio / TTS.
Conversational TTS — voice agent companion to Whisper.
Open zero-shot voice cloning + TTS.
VITS2 + BERT prosody TTS — companion to Whisper.
Style-conditioned TTS — pairs with Whisper for narration apps.
Original StyleTTS — predecessor of StyleTTS2.
Flow-matching TTS — open and fast.
Open zero-shot voice cloning TTS.
Few-shot voice cloning — companion to Whisper-cloned datasets.
Real-Time Voice Cloning interface — pairs with Whisper alignment.
Meta's audio-generation stack (MusicGen, AudioGen, EnCodec).
Neural audio codec — used by SeamlessM4T + many speech-LMs.
Mirror of facebookresearch/encodec.
Speech-without-text framework from Meta.
Masked-Autoencoder pretrain for audio — feeds downstream ASR.
High-quality neural audio codec — alternative to EnCodec.
Audio data tooling library that pairs with DAC.
Open robotics — includes spoken-command ASR demos.
Generative-audio diffusion — paired with Whisper for content pipelines.
Few-shot text classifier — useful for post-transcript tagging.
RAG over PDFs / transcripts — downstream ASR consumer pattern.
Training framework for large speech-LMs.
Open CLIP — companion vision encoder in multimodal ASR research.
Code-generation T5 — used in voice-coding agents on top of Whisper.
Conditional-LM — historical companion to speech-text research.
Safety layer often paired with Whisper voice agents.
Sequence models — companion to spoken-search recommender pipelines.
Alibaba's patched Megatron — used for Paraformer scale-up.
Compression library used by ASR data pipelines.
Catch-all for Google ASR papers (USM, BigSSL, Conformer).
Historical TF1 seq2seq — early Listen-Attend-Spell era.
Andrej Karpathy's bare-metal C training code — reference for compact ASR.
Face restoration — often paired with Whisper subtitle pipelines.
Stable-Diffusion animation — used with Whisper subs in content pipelines.
Voice-coding agent example over Whisper.
Spherical signal transforms — used in advanced ASR research.
Capitalized-name mirror of whisper-ctranslate2.
Pre-SYSTRAN home of faster-whisper.
The ggml-org-hosted mirror of llama.cpp.
Capitalized-name mirror of wenet.
Free browser-based manual transcription tool — keyboard-shortcut transcript editor.
Open dictation engines used by the Talon Voice community.
Open-source desktop transcription and dictation app built on Whisper.
Open-source Indic ASR models from IIT Madras' AI4Bharat lab — 22 scheduled Indian languages.
Mozilla Common Voice — public-domain multilingual speech corpus that powers many regional STT models.
Meta Massively Multilingual Speech — open-source ASR for 1,100+ languages.
Israeli national Hebrew ASR — research models from the Israeli AI consortium.
AI4D Africa — multilingual African speech datasets and ASR baselines.
Khipu community — open-source Andean Spanish, Quechua, and Aymara speech research.
VinAI Research — Vietnamese-language ASR and speech research from the Vingroup AI arm.
Khmer-language speech recognition research for the Cambodian market.
Typhoon — Thai-language LLM and ASR initiative from SCB 10X.
Mesolitica — Bahasa Malaysia and Bahasa Indonesia speech research checkpoints.
Tbilisi State University Georgian speech recognition research.
Yerevann research lab Armenian speech recognition checkpoints.
Open-source Turkish-language ASR checkpoints from Turkish university labs.
Kencorpus / Maseno — Kenyan Swahili and English code-switch speech dataset and baselines.
IIIT-Hyderabad speech lab — academic Indian-language ASR datasets and checkpoints.
IIT Madras speech group — academic Indian-language ASR research and AI4Bharat home.
IIT Bombay speech group — Indian-language ASR research and Bhashini contributions.
Akylai project — Kyrgyz-language voice assistant and ASR research.
Institute of Smart Systems and AI (Nazarbayev University) — Kazakh-language ASR research.
Open Telugu-language speech corpora and models for SE-Indian transcription.
Community-published Tamil-language ASR models and corpora.
Bengali-language ASR datasets and models from the BNLP / Bengali NLP community.
L3Cube Pune — Marathi-language NLP and speech research releases.
Kungliga Biblioteket (National Library of Sweden) Whisper fine-tunes for Swedish.
Norwegian National Library Whisper fine-tunes for Bokmål and Nynorsk.
Chinese University of Hong Kong — Cantonese speech research and open checkpoints.
Open-source framework for voice and multimodal conversational AI agents.
Open-source framework for building realtime AI voice agents on LiveKit's WebRTC stack.
Open-source conversational AI framework with voice channel integration.
Open-core conversational AI platform with voice channels.
Open-source desktop client routing voice to LLM voice agents.
Open-source privacy-respecting voice assistant for home automation.
Open-source framework by Agora for building realtime multimodal voice AI agents.
Kyutai's open speech-to-speech foundation model and demo voice agent.
Open-source Python library for building real-time voice-LLM applications.
Open-weights multilingual voice cloning from 6 seconds of audio — 17 languages.
Performance fork of Tortoise — quality kept, latency 5-10x lower.
Self-hosted open-source MOOC platform with caption-track support.
1000h read English audiobook corpus — the canonical ASR benchmark since 2015.
60k hours of unlabeled English audiobook audio for self-supervised pretraining.
Crowd-sourced multilingual speech corpus — 30k+ hours across 130 languages.
452h of TED talk audio + transcripts — the canonical lecture-style ASR benchmark.
400k hours of European Parliament speeches in 23 EU languages.
44.5k hours of read multilingual audiobook speech across 8 European languages.
TED-based English→X speech translation corpus across 14 target languages.
Common Voice-based speech-translation corpus — 21 X→en + 15 en→X language pairs.
Few-shot multilingual evaluation across 102 languages — n-way parallel speech.
Multilingual SUPERB — 143 languages × multiple tasks for self-supervised speech models.
Speech processing Universal PERformance Benchmark — 10 English speech tasks.
10,000h English ASR corpus — audiobook + podcast + YouTube blend, multiple subsets.
30,000h multilingual evolution of GigaSpeech — Thai, Indonesian, Vietnamese launch.
30,000h CC-BY-licensed English ASR corpus — Internet-Archive sourced.
500kh of YouTube speech across 100+ languages with CC-licensed subtitles.
Refresh of YODAS with long-form audio + per-language sharding — 422k hours.
5000h of professionally-transcribed earnings-call audio — financial-domain ASR.
125h earnings-call ASR test set with 27-accent speaker coverage.
100h multi-microphone meeting recordings with diarization + speaker labels.
72h research-meeting recordings — diarization and meeting-ASR alternative to AMI.
Real-world dinner-party recordings — far-field ASR + diarization in noise.
Distant-mic ASR challenge — multi-channel meeting transcription frontier.
100k utterances of celebrity speech from YouTube — speaker recognition benchmark.
1M utterances of celebrity speech — scaled-up speaker recognition corpus.
50h audio-visual diarization corpus — wild YouTube speakers in conversation.
Hard diarization-in-the-wild challenge — 11 domains from courtrooms to maps.
260h of conversational US English telephone speech — historical ASR benchmark.
2000h of telephone conversations — scaled-up successor to Switchboard.
60h of unscripted home-telephone conversations — diarization + ASR benchmark.
80h of read newspaper sentences — foundational read-speech ASR corpus from 1992.
Phonetically-balanced 5h read-speech corpus from 1986 — phoneme recognition benchmark.
178h Mandarin read-speech corpus — open Chinese ASR baseline.
1000h Mandarin read-speech corpus — scaled-up successor.
120h Mandarin meeting corpus — multi-speaker conference-room scenarios.
1000h Korean spontaneous-speech corpus — the open KR ASR baseline.
Japanese ASR corpus — 35k hours of TV recordings with captions.
Japanese-speech-from-YouTube corpus — open ASR scaling beyond Reazon.
30h Japanese versatile multi-speaker corpus — TTS + speaker-modeling baseline.
44h multi-speaker English corpus — 109 speakers across global accents for TTS.
24h single-speaker English audiobook corpus — the canonical TTS baseline.
12h dyadic emotional speech corpus — the gold-standard SER benchmark.
Audio-visual emotional speech + song corpus — open SER benchmark.
Multimodal emotion corpus from Friends TV show — conversational emotion recognition.
7442 audio-visual emotional speech clips from 91 actors — open SER corpus.
109h corpus of music + speech + noise — augmentation backbone for ASR/SV.
Room impulse responses + isotropic noises — reverberation augmentation set.
Open Speech and Language Resources — the index of 130+ free speech corpora.
6.6kh language-identification corpus — 107 languages from YouTube.
30h spoken-language-understanding corpus — intent classification benchmark.
1s keyword-spotting corpus — 35 single-word commands, ~100k utterances.
Long-form Wikipedia audiobook recordings in English / German / Dutch — ~1000h.
BBC broadcast-media ASR + diarization challenge — multi-year evaluation series.
Long-form podcast ASR + speaker-role corpus.
100k hours of English podcasts with metadata — TREC podcast evaluation corpus.
Multilingual conversational SLU dataset — 6 languages with disfluencies + code-switching.
5kh weakly-labeled multilingual TTS corpus from YouTube — 50 languages.
Toy 60-utterance Hebrew corpus — the Kaldi 'hello world' dataset.
16kh Indic-language ASR corpus across 22 Indian languages.
1684h read-speech ASR benchmark across 12 Indian languages.
Indic-language version of SUPERB — 12 languages × 6 speech tasks.
6457h Indic-language ASR corpus from All India Radio news broadcasts.
20kh Russian ASR corpus — the largest open Russian-language speech dataset.
Crowd-sourced multilingual read-speech corpus — the open-source pre-Common-Voice corpus.
15h Vietnamese read-speech ASR corpus — the open Vietnamese ASR baseline.
36h Thai emotional-speech corpus — the open Thai SER + ASR baseline.
HuggingFace ASR leaderboard — public WER + RTFx across 8 English test sets.
Aggregated ASR leaderboards across 100+ benchmarks + papers + code.
Korean government open-data hub for speech + NLP corpora — 30+ speech datasets.
NIST Speaker Recognition Evaluation — the canonical SV/SD benchmark series.
Open Speech Analytic Technologies — noise-robust ASR + KWS + SAD challenge.
Speech-translation corpus from European Parliament across 9 languages.
Low-resource multilingual ASR + KWS corpora — 25+ languages from telephony.
Johns Hopkins Center for Language and Speech Processing — Kaldi + LibriSpeech + Sherpa origins.
Brno University of Technology speech group — DIHARD + x-vector + WeSpeaker origins.
Centre for Speech Technology Research — VCTK + Merlin TTS + Festival origins.
Carnegie Mellon Language Technologies Institute — Sphinx + ESPnet + YODAS origins.
MIT Spoken Language Systems Group — TIMIT + Galaxy + Jupiter origins.
National Taiwan University Speech Lab — S3PRL + SUPERB origins.
Meta AI speech research — wav2vec 2.0 + HuBERT + MMS + Seamless origins.
Google Research Speech — USM + Chirp + AudioPaLM + FLEURS origins.
NVIDIA Speech Research — NeMo + Canary + Parakeet + Riva origins.
IIT Madras Indic AI lab — IndicVoices + Kathbath + IndicSUPERB + IndicWav2Vec.
Inria Nancy speech research team — diarization + speech enhancement leaders.
French national speech-tech lab — TC-STAR + Quaero + ELRA-LDC origins.
RWTH Aachen i6 group — RASR toolkit + IWSLT speech translation history.
International Computer Science Institute — ICSI Meeting Corpus + Aurora origins.
Mitsubishi Electric Research Labs Speech Group — CHiME + speech-enhancement leaders.
MLCommons Speech working group — People's Speech + MLPerf speech benchmarks.
International Workshop on Spoken Language Translation — annual ST evaluation.
Hub of 5000+ audio + speech datasets — the modern catalog after OpenSLR.
Open-source multilingual TTS with zero-shot voice cloning.
Open-source generative audio model from Suno — speech, music, and sound effects.
Open-source neural TTS with strong prosody and voice cloning.
MyShell's open-source voice cloning with tone-color extraction.
High-quality multi-lingual TTS from MyShell — fast and CPU-friendly.
End-to-end TTS with adversarial training — the open-source workhorse.
Non-autoregressive TTS reference implementation — fast and parallelizable.
ESPnet's TTS recipes — multi-architecture, multi-language.
Mycroft's neural TTS — designed for Raspberry Pi voice assistants.
Rhasspy's predecessor TTS — Tacotron-style models for offline assistants.
Fast, on-device neural TTS optimized for Raspberry Pi 4.
Classic Edinburgh / CMU concatenative TTS — academic reference.
Compact open-source TTS for 100+ languages — the embedded workhorse.
Java-based open-source TTS platform — research and academic deployments.
Diphone-based TTS engine — paired with eSpeak NG for more natural output.
Google's seminal end-to-end TTS architecture — the neural-TTS starting point.
Diffusion-probabilistic TTS reference implementation.
NVIDIA's parallel TTS architecture with explicit pitch control.
Lightweight 82M-param open-source TTS — Apache-2.0, runs on a Raspberry Pi.
Resemble AI's open-source emotion-aware TTS — community-licensed.
DeepMind's seminal 2016 neural-vocoder paper — historical reference only.
GAN-based neural vocoder reference — fast and high-quality.
Camb.ai's open-source MARS5 multilingual TTS reference.
Open-source toolkit for audio, music, and speech generation.
Bilibili's open-source TTS — Chinese + English bilingual.
Open-source voice assistant — community-forked after the original company wound down.
Community continuation of Mycroft — modular open-source voice assistant for Linux + Pi.
Fully offline voice assistant for Home Assistant — runs on a Raspberry Pi with no cloud.
Home Assistant's first-party voice surface — Rhasspy's successor, integrated into HA core.
Open-source personal assistant — self-hostable, privacy-respecting, modular skills.
Open-source DIY captioning glasses powered by Whisper — community hardware project.
Open-source wake-word engine — community alternative to Porcupine and Snips.
Legacy customizable wake-word engine — community-maintained after KITT.AI shutdown.