Looking at stable-ts? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

stable-ts

by jianfch

Word-perfect timestamps for SRT/VTT subtitles on top of any Whisper backend — vanilla Whisper, faster-whisper, HF Transformers, or MLX.

TL;DR

MIT-licensed Python library + CLI that wraps OpenAI Whisper, faster-whisper, Hugging Face Transformers, and MLX Whisper to produce stabilised word-level timestamps. Re-runs the decode with a tuned segmentation strategy, then applies silence suppression (Silero VAD or custom denoiser), refinement (token-probability re-scoring), and regrouping rules to land cleaner SRT/VTT/ASS timing than the raw Whisper output. Latest stable: 2.19.1 on PyPI, Python ≥3.8.

Best for subtitle creators, video editors, and accessibility teams who need karaoke-quality word timing, force-alignment of existing transcripts to fresh audio, or both. Pair with faster-whisper for the throughput sweet spot or with MLX on Apple Silicon. License: MIT.

Category
Open source
License
MIT
Stars
★ 2.2k
Last push
2025-10-29
Pricing
free
Platforms
Linux, macOS, Windows, GPU

What it is

stable-ts tackles Whisper's weakest point — word-level timestamp drift — by re-running inference with a tuned segmentation strategy. The result: noticeably cleaner SRT/VTT output for subtitles and shorts. Pairs well with whisper.cpp or faster-whisper as the underlying engine.

Best for: Subtitle generation where drift and poor word boundaries matter.
Watch out for: Slower than faster-whisper; no diarization; less community momentum than whisperX.

Install / use

pip install stable-ts

Where stable-ts shines · use-case cards

stable-ts solves a small set of problems very well. Each card below maps a concrete subtitle / alignment workflow to the right flag or API entry point. Links point to the canonical section of the project README.

SRT / VTT for video
Output formats · subtitle-grade timing

The default reason to reach for stable-ts. One command produces a polished SRT/VTT/ASS file with stabilised segment + word timing — typically materially less drift than vanilla Whisper at the same WER.


CLI: stable-ts audio.mp3 -o audio.srt · also -o .vtt / .ass / .tsv / .json
Karaoke / word-highlighted captions
Word-level timestamps for highlight rendering

Per-word start/end timestamps are on by default. Export to ASS for karaoke-style word highlights in the same pass, or pull the JSON output and drive your own caption renderer.


Methods: result.to_ass(...) · result.to_srt_vtt(word_level=True) · result.to_dict()
Align an existing transcript
Force-align text → audio

Have a clean human transcript but no timing? align() pins the words to the audio without re-running full transcription — much faster than transcribing and ideal for iterating on segmentation rules. Realign a previous result the same way.


Python: model.align(audio, text, language='en') · also align_words(...) for fast in-segment word timing
Backend of choice
Whisper · faster-whisper · HF · MLX

Same stabilisation pipeline on top of any backend. Default is vanilla Whisper; switch with a single CLI flag or by loading a different model class in Python. Pick by hardware: faster-whisper on NVIDIA, MLX on Apple Silicon, HF Transformers for non-standard checkpoints.


CLI flags: -fw faster-whisper · -hw Hugging Face · -mlx MLX Whisper
Silence suppression / VAD
Silero VAD + denoiser preprocessing

Whisper's biggest timestamp failure mode is hallucinated text over silence and word boundaries that snap to silence edges. Enable VAD to drop silent regions before decode, and suppress_silence to snap timestamps to true voiced regions afterward.


Flags: --vad (Silero) · --suppress_silence · --denoiser · --q_levels · --k_size
Refinement + word locating
Token re-scoring · quick word search

Refinement iteratively mutes spans of audio to find the true word boundaries via token-probability shifts — slower than a single decode but the closest thing to ground-truth timing without forced alignment. locate() searches an audio file for when specific words are spoken without full transcription.


Python: model.refine(audio, result) · model.locate(audio, text=['word'])
Selecting at runtime: the CLI takes the backend flag (-fw / -hw / -mlx); the Python API exposes stable_whisper.load_model, load_faster_whisper_model, load_hf_whisper, and load_mlx_whisper — pick whichever matches your hardware and call the same .transcribe() / .align() / .refine() methods on the result. For raw throughput see faster-whisper; for diarization on top of word timing see WhisperX.

Setup recipes · pick one and copy

Three runnable configurations covering the most common stable-ts deployments. Each block is copy-and-run against stable-ts 2.19.x.

1Install + first subtitle

pip install -U stable-ts · CLI · one-line SRT export. The default pick.

# stable-ts 2.19.x · Python >=3.8
pip install -U stable-ts

# Transcribe a video / audio file and write SRT in one command.
stable-ts input.mp4 -o out.srt

# Other output formats:
#   stable-ts input.mp4 -o out.vtt
#   stable-ts input.mp4 -o out.ass   # karaoke-friendly word highlights
#   stable-ts input.mp4 -o out.json  # raw timings for custom renderers

# Pick a Whisper model size with --model (default: base)
stable-ts input.mp4 --model large-v3 -o out.srt
2Backend = faster-whisper (4× speedup)

Add the [fw] extra, pass -fw, point at large-v3 on a GPU. Production sweet spot when you also want stable-ts post-processing.

# stable-ts + faster-whisper backend
pip install -U "stable-ts[fw]"

# CLI: -fw routes the decode through faster-whisper (CTranslate2).
stable-ts input.mp3 \
  -fw \
  --model large-v3 \
  --vad true \
  --suppress_silence true \
  -o out.srt

# Python equivalent
import stable_whisper
model = stable_whisper.load_faster_whisper("large-v3")
result = model.transcribe(
    "input.mp3",
    vad=True,
    suppress_silence=True,
    word_timestamps=True,
)
result.to_srt_vtt("out.srt", word_level=False)
3Force-align an existing transcript

Use model.align() when you already have clean text and just need timestamps. Significantly faster than re-transcribing.

# Force-align plain text to audio at word level.
import stable_whisper

model = stable_whisper.load_model("base")

text = (
    "Machines thinking, breeding. "
    "You were to bear us a new, promised land."
)

result = model.align("audio.mp3", text, language="en")

# Inspect or export
for seg in result.segments:
    for w in seg.words:
        print(f"{w.start:6.2f}-{w.end:6.2f}  {w.word!r}")

result.to_srt_vtt("out.srt", word_level=True)   # word-highlight subtitles
Source: Alignment ↗. For faster in-segment timing only, see align_words() ↗

Features

Speaker diarizationNo
Word-level timestampsYes
Streaming / real-timeNo
Languages supported99
HIPAA eligibleNo

Links

stable-ts vs Whipscribe

Featurestable-tsWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsYesYes
StreamingNoNo
Languages9999
PlatformsLinux, macOS, Windows, GPUWeb, API, MCP

Alternatives to stable-ts

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.