Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min
SeamlessM4T
Meta's foundation model for multilingual ASR, speech-to-speech translation, and text translation in one stack. Research-licensed (CC-BY-NC-4.0).
SeamlessM4T is Meta's unified speech + text translation model from the Seamless Communication project, described in arXiv:2308.11596. The flagship seamless-m4t-v2-large is a 2.3B-parameter checkpoint that runs five tasks in one model — S2ST (speech-to-speech), S2TT (speech-to-text), T2ST (text-to-speech), T2TT (text-to-text), and ASR — covering 101 languages on speech input, 96 on text, and 35 on speech output. Companion models SeamlessExpressive (prosody + voice preservation) and SeamlessStreaming (low-latency real-time) extend the same family.
Best for research, prototypes, and evaluation where a single model has to span ASR + translation + synthesis across long-tail languages — or where you want to compare against the Whisper family on multilingual S2TT. License warning: the M4T and Streaming checkpoints are CC-BY-NC-4.0 (non-commercial); SeamlessExpressive uses the separate Seamless License. Commercial deployment is not permitted without separate authorisation from Meta. For a permissive multilingual alternative use openai/whisper or faster-whisper. Install via pip install transformers torch torchaudio (HF path) or the fairseq2 source build.
What it is
SeamlessM4T is Meta's "do everything with speech" foundation model. Good for projects that need transcription + translation in the same pass, particularly for languages Whisper covers poorly. Check the license — the research license rules out commercial deployment without separate permission.
Watch out for: Research license (non-commercial for v1); heavier runtime than Whisper; docs assume fairseq familiarity.
Install / use
pip install fairseq2 # then load SeamlessM4T v2
Pick a task · 5 modes in one model
Every mode below is selected at inference time on the same v2-large checkpoint — you pass the same model id to SeamlessM4Tv2Model.from_pretrained(...) and switch behaviour via generate(..., tgt_lang=..., generate_speech=...) or the --task flag on the m4t_predict CLI. Links point at Meta's research surface, not direct weight downloads.
Transcribe or translate spoken audio into text. ASR mode keeps the source language; S2TT translates the speech into a different target text in one pass — no separate ASR + MT cascade. The widest input coverage of any current open speech model.
2.3B params · 101 speech-input languages · pass tgt_lang same as source for ASR, different for S2TT · no native diarization / word timestamps
Translate input audio directly into spoken audio in another language without going through a text intermediary. Output covers 35 speech-output languages — fewer than the input set, but the largest open S2ST footprint available.
101 speech-in / 35 speech-out · waveform output via integrated vocoder · generate_speech=True on the HF API
Take written input in one language and produce spoken audio in another. Useful when you already have a text source — captions, articles, prompts — and want narrated output in a target language without a separate TTS engine.
96 text-in / 35 speech-out · same vocoder path as S2ST · single-speaker output (no voice cloning here — use SeamlessExpressive for prosody / voice)
Plain text in, plain text out — the conventional MT mode. Competitive against NLLB and other open MT baselines on the same language pairs, and useful as a sanity-check head when debugging S2TT quality.
96 text-in / 96 text-out · generate_speech=False · same checkpoint as the speech tasks
A separate checkpoint that preserves the speaker's voice, rhythm, and emotion when translating speech-to-speech. Smaller language coverage than M4T, and the license is the bespoke Seamless License rather than CC-BY-NC-4.0 — review terms before any deployment.
Voice + emotion preservation · narrower language matrix than v2-large · gated download on HF · Seamless License (non-commercial-leaning, read full text)
Setup recipes · pick one and copy
Three runnable paths. Recipe 1 is the easiest — HF transformers loads the v2 checkpoint with no extra build. Recipe 2 is the official fairseq2 source path Meta uses internally. Recipe 3 skips installation entirely and runs the model on Meta's hosted HF Space.
pip install transformers torch torchaudio · SeamlessM4Tv2Model + AutoProcessor. Same code handles every task — switch by changing tgt_lang and generate_speech.
# SeamlessM4T v2 via Hugging Face transformers
pip install --upgrade transformers torch torchaudio
# infer.py
import torchaudio
from transformers import AutoProcessor, SeamlessM4Tv2Model
model_id = "facebook/seamless-m4t-v2-large"
processor = AutoProcessor.from_pretrained(model_id)
model = SeamlessM4Tv2Model.from_pretrained(model_id)
# ── Speech-to-text translation (S2TT) — English audio -> French text ──
audio, sr = torchaudio.load("meeting.wav")
audio = torchaudio.functional.resample(audio, sr, 16_000)
inputs = processor(audios=audio, return_tensors="pt")
text = model.generate(**inputs, tgt_lang="fra", generate_speech=False)
print(processor.decode(text[0].tolist()[0], skip_special_tokens=True))
# ── ASR (transcribe in source language) ──
asr = model.generate(**inputs, tgt_lang="eng", generate_speech=False)
print(processor.decode(asr[0].tolist()[0], skip_special_tokens=True))
# ── Speech-to-speech translation (S2ST) — returns a waveform ──
wav = model.generate(**inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
The path Meta uses in the paper — clone the repo, install in editable mode, run the m4t_predict CLI. Heavier setup, but you get the full Seamless toolchain (expressive, streaming, eval scripts).
# fairseq2 source install · official Meta CLI
git clone https://github.com/facebookresearch/seamless_communication.git
cd seamless_communication
# fairseq2 wheel is published per torch / cuda; check the README for your combo
pip install -e .
# ── Speech-to-text translation: English wav -> French text ──
m4t_predict input.wav \
--task s2tt \
--src_lang eng \
--tgt_lang fra \
--model_name seamlessM4T_v2_large
# ── Speech-to-speech: English wav -> Spanish wav ──
m4t_predict input.wav \
--task s2st \
--src_lang eng \
--tgt_lang spa \
--model_name seamlessM4T_v2_large \
--output_path translated.wav
# ── Text-to-text: French text -> English text ──
m4t_predict "Bonjour, comment ca va?" \
--task t2tt \
--src_lang fra \
--tgt_lang eng \
--model_name seamlessM4T_v2_large
Meta hosts an official Gradio Space — paste a sentence or upload a short clip and inspect the output before committing to a local install. Good for license-conscious evaluation since no weights are downloaded.
# No code path — open in a browser:
#
# https://huggingface.co/spaces/facebook/seamless_m4t
#
# The Space runs SeamlessM4T v2 with a UI for:
# - S2ST (mic / file -> translated audio)
# - S2TT (mic / file -> translated text)
# - T2ST (text -> translated audio)
# - T2TT (text -> translated text)
#
# Tip: the Space surface is rate-limited and occasionally offline
# while Meta rebuilds dependencies. If the demo errors, the next-best
# zero-install option is the companion notebook on the repo:
# https://github.com/facebookresearch/seamless_communication/tree/main/demo
Features
| Speaker diarization | No |
| Word-level timestamps | No |
| Streaming / real-time | Yes |
| Languages supported | 100 |
| HIPAA eligible | No |
Links
- facebookresearch/seamless_communication ↗ ↗main repo · M4T, Expressive, Streaming, eval scripts, m4t_predict CLI
- Meta AI · Seamless Communication research page ↗ ↗project landing · blog posts, capabilities overview, demo videos
- facebook/seamless-m4t-v2-large model card ↗ ↗flagship 2.3B checkpoint · 101 speech-in / 96 text / 35 speech-out languages · CC-BY-NC-4.0
- SeamlessM4T paper · arXiv:2308.11596 ↗ ↗Seamless Communication, Barrault et al. (2023) · architecture, training data, eval
- huggingface.co/spaces/facebook/seamless_m4t ↗ ↗official Gradio demo · all 4 tasks in-browser, no install
- facebook/seamless-expressive ↗ ↗companion model · preserves voice + prosody · separate Seamless License
- facebook/seamless-streaming ↗ ↗companion model · low-latency real-time S2TT and S2ST · CC-BY-NC-4.0
- transformers docs · SeamlessM4Tv2 ↗ ↗API reference · SeamlessM4Tv2Model, AutoProcessor, every generate-time argument
- CC-BY-NC-4.0 license ↗ ↗the non-commercial license that gates SeamlessM4T and SeamlessStreaming weights
SeamlessM4T vs Whipscribe
| Feature | SeamlessM4T | Whipscribe |
|---|---|---|
| Category | Open source | Transcription APIs |
| Pricing | free | free beta |
| Speaker diarization | No | Yes |
| Word timestamps | No | Yes |
| Streaming | Yes | No |
| Languages | 100 | 99 |
| Platforms | Linux, GPU | Web, API, MCP |
Alternatives to SeamlessM4T
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.