Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
stable-ts
Word-perfect timestamps for SRT/VTT subtitles on top of any Whisper backend — vanilla Whisper, faster-whisper, HF Transformers, or MLX.
MIT-licensed Python library + CLI that wraps OpenAI Whisper, faster-whisper, Hugging Face Transformers, and MLX Whisper to produce stabilised word-level timestamps. Re-runs the decode with a tuned segmentation strategy, then applies silence suppression (Silero VAD or custom denoiser), refinement (token-probability re-scoring), and regrouping rules to land cleaner SRT/VTT/ASS timing than the raw Whisper output. Latest stable: 2.19.1 on PyPI, Python ≥3.8.
Best for subtitle creators, video editors, and accessibility teams who need karaoke-quality word timing, force-alignment of existing transcripts to fresh audio, or both. Pair with faster-whisper for the throughput sweet spot or with MLX on Apple Silicon. License: MIT.
What it is
stable-ts tackles Whisper's weakest point — word-level timestamp drift — by re-running inference with a tuned segmentation strategy. The result: noticeably cleaner SRT/VTT output for subtitles and shorts. Pairs well with whisper.cpp or faster-whisper as the underlying engine.
Watch out for: Slower than faster-whisper; no diarization; less community momentum than whisperX.
Install / use
pip install stable-ts
Where stable-ts shines · use-case cards
stable-ts solves a small set of problems very well. Each card below maps a concrete subtitle / alignment workflow to the right flag or API entry point. Links point to the canonical section of the project README.
The default reason to reach for stable-ts. One command produces a polished SRT/VTT/ASS file with stabilised segment + word timing — typically materially less drift than vanilla Whisper at the same WER.
CLI: stable-ts audio.mp3 -o audio.srt · also -o .vtt / .ass / .tsv / .json
Per-word start/end timestamps are on by default. Export to ASS for karaoke-style word highlights in the same pass, or pull the JSON output and drive your own caption renderer.
Methods: result.to_ass(...) · result.to_srt_vtt(word_level=True) · result.to_dict()
Have a clean human transcript but no timing? align() pins the words to the audio without re-running full transcription — much faster than transcribing and ideal for iterating on segmentation rules. Realign a previous result the same way.
Python: model.align(audio, text, language='en') · also align_words(...) for fast in-segment word timing
Same stabilisation pipeline on top of any backend. Default is vanilla Whisper; switch with a single CLI flag or by loading a different model class in Python. Pick by hardware: faster-whisper on NVIDIA, MLX on Apple Silicon, HF Transformers for non-standard checkpoints.
CLI flags: -fw faster-whisper · -hw Hugging Face · -mlx MLX Whisper
Whisper's biggest timestamp failure mode is hallucinated text over silence and word boundaries that snap to silence edges. Enable VAD to drop silent regions before decode, and suppress_silence to snap timestamps to true voiced regions afterward.
Flags: --vad (Silero) · --suppress_silence · --denoiser · --q_levels · --k_size
Refinement iteratively mutes spans of audio to find the true word boundaries via token-probability shifts — slower than a single decode but the closest thing to ground-truth timing without forced alignment. locate() searches an audio file for when specific words are spoken without full transcription.
Python: model.refine(audio, result) · model.locate(audio, text=['word'])
-fw / -hw / -mlx); the Python API exposes stable_whisper.load_model, load_faster_whisper_model, load_hf_whisper, and load_mlx_whisper — pick whichever matches your hardware and call the same .transcribe() / .align() / .refine() methods on the result. For raw throughput see faster-whisper; for diarization on top of word timing see WhisperX.Setup recipes · pick one and copy
Three runnable configurations covering the most common stable-ts deployments. Each block is copy-and-run against stable-ts 2.19.x.
pip install -U stable-ts · CLI · one-line SRT export. The default pick.
# stable-ts 2.19.x · Python >=3.8
pip install -U stable-ts
# Transcribe a video / audio file and write SRT in one command.
stable-ts input.mp4 -o out.srt
# Other output formats:
# stable-ts input.mp4 -o out.vtt
# stable-ts input.mp4 -o out.ass # karaoke-friendly word highlights
# stable-ts input.mp4 -o out.json # raw timings for custom renderers
# Pick a Whisper model size with --model (default: base)
stable-ts input.mp4 --model large-v3 -o out.srt
Add the [fw] extra, pass -fw, point at large-v3 on a GPU. Production sweet spot when you also want stable-ts post-processing.
# stable-ts + faster-whisper backend
pip install -U "stable-ts[fw]"
# CLI: -fw routes the decode through faster-whisper (CTranslate2).
stable-ts input.mp3 \
-fw \
--model large-v3 \
--vad true \
--suppress_silence true \
-o out.srt
# Python equivalent
import stable_whisper
model = stable_whisper.load_faster_whisper("large-v3")
result = model.transcribe(
"input.mp3",
vad=True,
suppress_silence=True,
word_timestamps=True,
)
result.to_srt_vtt("out.srt", word_level=False)
Use model.align() when you already have clean text and just need timestamps. Significantly faster than re-transcribing.
# Force-align plain text to audio at word level.
import stable_whisper
model = stable_whisper.load_model("base")
text = (
"Machines thinking, breeding. "
"You were to bear us a new, promised land."
)
result = model.align("audio.mp3", text, language="en")
# Inspect or export
for seg in result.segments:
for w in seg.words:
print(f"{w.start:6.2f}-{w.end:6.2f} {w.word!r}")
result.to_srt_vtt("out.srt", word_level=True) # word-highlight subtitles
Features
| Speaker diarization | No |
| Word-level timestamps | Yes |
| Streaming / real-time | No |
| Languages supported | 99 |
| HIPAA eligible | No |
Links
- jianfch/stable-ts ↗ ↗main repo · README is the canonical reference for CLI flags, output methods, and the alignment / refinement / locating APIs
- PyPI · stable-ts 2.19.1 ↗ ↗latest stable, released 2025-08-16 · Python ≥3.8 · install with pip install -U stable-ts
- README · Alignment ↗ ↗force-align plain text or an existing result to audio at word level via model.align() and align_words()
- README · Refinement ↗ ↗iterative timestamp refinement by muting audio spans and re-scoring token probabilities — the closest thing to ground-truth word boundaries
- README · Silence suppression ↗ ↗the --vad, --suppress_silence, --denoiser, --q_levels, --k_size flags that drop hallucinated text over silence and snap word boundaries to voiced regions
- README · Regrouping words ↗ ↗chain regrouping operations (split_by_punctuation, merge_by_gap, ...) to shape subtitle segment boundaries before export
- README · Locating words ↗ ↗model.locate() searches an audio file for when specific words are spoken without running a full transcription
- openai/whisper ↗ ↗the reference Whisper implementation that stable-ts wraps by default
- SYSTRAN/faster-whisper ↗ ↗the CTranslate2 backend selected via the -fw CLI flag or load_faster_whisper(...) · ~4× faster decode at the same WER
- mlx-examples · Whisper ↗ ↗the Apple Silicon backend selected via the -mlx CLI flag — runs Whisper natively on M-series GPUs
- snakers4/silero-vad ↗ ↗the voice-activity-detection model stable-ts uses when --vad is enabled
stable-ts vs Whipscribe
| Feature | stable-ts | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | Yes | Yes |
| Streaming | No | No |
| Languages | 99 | 99 |
| Platforms | Linux, macOS, Windows, GPU | Web, API, MCP |
Alternatives to stable-ts
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.