Looking at stable-ts? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

stable-ts

Name: stable-ts
Author: jianfch

by jianfch

Word-perfect timestamps for SRT/VTT subtitles on top of any Whisper backend — vanilla Whisper, faster-whisper, HF Transformers, or MLX.

TL;DR

MIT-licensed Python library + CLI that wraps OpenAI Whisper, faster-whisper, Hugging Face Transformers, and MLX Whisper to produce stabilised word-level timestamps. Re-runs the decode with a tuned segmentation strategy, then applies silence suppression (Silero VAD or custom denoiser), refinement (token-probability re-scoring), and regrouping rules to land cleaner SRT/VTT/ASS timing than the raw Whisper output. Latest stable: 2.19.1 on PyPI, Python ≥3.8.

Best for subtitle creators, video editors, and accessibility teams who need karaoke-quality word timing, force-alignment of existing transcripts to fresh audio, or both. Pair with faster-whisper for the throughput sweet spot or with MLX on Apple Silicon. License: MIT.

What it is

stable-ts tackles Whisper's weakest point — word-level timestamp drift — by re-running inference with a tuned segmentation strategy. The result: noticeably cleaner SRT/VTT output for subtitles and shorts. Pairs well with whisper.cpp or faster-whisper as the underlying engine.

Best for: Subtitle generation where drift and poor word boundaries matter.
Watch out for: Slower than faster-whisper; no diarization; less community momentum than whisperX.

Install / use

pip install stable-ts

Where stable-ts shines · use-case cards

stable-ts solves a small set of problems very well. Each card below maps a concrete subtitle / alignment workflow to the right flag or API entry point. Links point to the canonical section of the project README.

SRT / VTT for video

Output formats · subtitle-grade timing

The default reason to reach for stable-ts. One command produces a polished SRT/VTT/ASS file with stabilised segment + word timing — typically materially less drift than vanilla Whisper at the same WER.

CLI: stable-ts audio.mp3 -o audio.srt · also -o .vtt / .ass / .tsv / .json

Karaoke / word-highlighted captions

Word-level timestamps for highlight rendering

Per-word start/end timestamps are on by default. Export to ASS for karaoke-style word highlights in the same pass, or pull the JSON output and drive your own caption renderer.

Methods: result.to_ass(...) · result.to_srt_vtt(word_level=True) · result.to_dict()

Align an existing transcript

Force-align text → audio

Have a clean human transcript but no timing? align() pins the words to the audio without re-running full transcription — much faster than transcribing and ideal for iterating on segmentation rules. Realign a previous result the same way.

Python: model.align(audio, text, language='en') · also align_words(...) for fast in-segment word timing

Backend of choice

Whisper · faster-whisper · HF · MLX

Same stabilisation pipeline on top of any backend. Default is vanilla Whisper; switch with a single CLI flag or by loading a different model class in Python. Pick by hardware: faster-whisper on NVIDIA, MLX on Apple Silicon, HF Transformers for non-standard checkpoints.

CLI flags: -fw faster-whisper · -hw Hugging Face · -mlx MLX Whisper

Silence suppression / VAD

Silero VAD + denoiser preprocessing

Whisper's biggest timestamp failure mode is hallucinated text over silence and word boundaries that snap to silence edges. Enable VAD to drop silent regions before decode, and suppress_silence to snap timestamps to true voiced regions afterward.

Flags: --vad (Silero) · --suppress_silence · --denoiser · --q_levels · --k_size

Refinement + word locating

Token re-scoring · quick word search

Refinement iteratively mutes spans of audio to find the true word boundaries via token-probability shifts — slower than a single decode but the closest thing to ground-truth timing without forced alignment. locate() searches an audio file for when specific words are spoken without full transcription.

Python: model.refine(audio, result) · model.locate(audio, text=['word'])

Selecting at runtime: the CLI takes the backend flag (-fw / -hw / -mlx); the Python API exposes stable_whisper.load_model, load_faster_whisper_model, load_hf_whisper, and load_mlx_whisper — pick whichever matches your hardware and call the same .transcribe() / .align() / .refine() methods on the result. For raw throughput see faster-whisper; for diarization on top of word timing see WhisperX.

Setup recipes · pick one and copy

Three runnable configurations covering the most common stable-ts deployments. Each block is copy-and-run against stable-ts 2.19.x.

1Install + first subtitle

pip install -U stable-ts · CLI · one-line SRT export. The default pick.

# stable-ts 2.19.x · Python >=3.8
pip install -U stable-ts

# Transcribe a video / audio file and write SRT in one command.
stable-ts input.mp4 -o out.srt

# Other output formats:
#   stable-ts input.mp4 -o out.vtt
#   stable-ts input.mp4 -o out.ass   # karaoke-friendly word highlights
#   stable-ts input.mp4 -o out.json  # raw timings for custom renderers

# Pick a Whisper model size with --model (default: base)
stable-ts input.mp4 --model large-v3 -o out.srt

Source: jianfch/stable-ts · Usage ↗. PyPI: stable-ts 2.19.1 ↗

2Backend = faster-whisper (4× speedup)

Add the [fw] extra, pass -fw, point at large-v3 on a GPU. Production sweet spot when you also want stable-ts post-processing.

# stable-ts + faster-whisper backend
pip install -U "stable-ts[fw]"

# CLI: -fw routes the decode through faster-whisper (CTranslate2).
stable-ts input.mp3 \
  -fw \
  --model large-v3 \
  --vad true \
  --suppress_silence true \
  -o out.srt

# Python equivalent
import stable_whisper
model = stable_whisper.load_faster_whisper("large-v3")
result = model.transcribe(
    "input.mp3",
    vad=True,
    suppress_silence=True,
    word_timestamps=True,
)
result.to_srt_vtt("out.srt", word_level=False)

Source: Setup ↗ + Silence suppression ↗. Backend repo: SYSTRAN/faster-whisper ↗

3Force-align an existing transcript

Use model.align() when you already have clean text and just need timestamps. Significantly faster than re-transcribing.

# Force-align plain text to audio at word level.
import stable_whisper

model = stable_whisper.load_model("base")

text = (
    "Machines thinking, breeding. "
    "You were to bear us a new, promised land."
)

result = model.align("audio.mp3", text, language="en")

# Inspect or export
for seg in result.segments:
    for w in seg.words:
        print(f"{w.start:6.2f}-{w.end:6.2f}  {w.word!r}")

result.to_srt_vtt("out.srt", word_level=True)   # word-highlight subtitles

Source: Alignment ↗. For faster in-segment timing only, see align_words() ↗

Features

Speaker diarization	No
Word-level timestamps	Yes
Streaming / real-time	No
Languages supported	99
HIPAA eligible	No

stable-ts vs Whipscribe

Feature	stable-ts	Whipscribe
Category	Open source	Transcription APIs
Pricing	free	free beta
Speaker diarization	No	Yes
Word timestamps	Yes	Yes
Streaming	No	No
Languages	99	99
Platforms	Linux, macOS, Windows, GPU	Web, API, MCP

Alternatives to stable-ts

OpenAI Whisper

OpenAI

The reference open-source multilingual ASR model from OpenAI.

OSS · MIT ★ 98.1k

whisper.cpp

Georgi Gerganov

C/C++ port of Whisper — runs on anything, from a Raspberry Pi to Apple Silicon.

OSS · MIT ★ 48.8k

faster-whisper

SYSTRAN

4× faster than reference Whisper using CTranslate2 — production sweet spot.

OSS · MIT ★ 22.3k

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.