ESPnet TTS

by ESPnet

ESPnet's TTS recipes — multi-architecture, multi-language.

TL;DR

ESPnet's TTS recipes — multi-architecture, multi-language.

Best for researchers benchmarking TTS architectures (Tacotron2 / FastSpeech / VITS / etc.) on a common harness. Pricing: free (Apache-2.0).

Category
Open source
License
Stars
Last push
Pricing
free (Apache-2.0)
Platforms
Linux

What it is

ESPnet is an end-to-end speech-processing framework with a TTS sub-tree covering Tacotron 2, Transformer-TTS, FastSpeech, FastSpeech 2, VITS, JETS, and more. Pretrained checkpoints in 30+ languages. Apache-2.0. Consent posture: synthetic-only by default.

Best for: Researchers benchmarking TTS architectures (Tacotron2 / FastSpeech / VITS / etc.) on a common harness.
Watch out for: Heavy framework — overkill for pure inference; better suited to training.

Install / use

pip install espnet

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeNo
Languages supported30
HIPAA eligibleNo

ESPnet TTS vs Whipscribe

FeatureESPnet TTSWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfree (Apache-2.0)free beta
Speaker diarizationYes
Word timestampsYes
StreamingNo
Languages3099
PlatformsLinuxWeb, API, MCP

Alternatives to ESPnet TTS

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.