Looking at SeamlessM4T? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

SeamlessM4T

by Meta AI

Meta's foundation model for multilingual ASR, speech-to-speech translation, and text translation in one stack. Research-licensed (CC-BY-NC-4.0).

TL;DR

SeamlessM4T is Meta's unified speech + text translation model from the Seamless Communication project, described in arXiv:2308.11596. The flagship seamless-m4t-v2-large is a 2.3B-parameter checkpoint that runs five tasks in one model — S2ST (speech-to-speech), S2TT (speech-to-text), T2ST (text-to-speech), T2TT (text-to-text), and ASR — covering 101 languages on speech input, 96 on text, and 35 on speech output. Companion models SeamlessExpressive (prosody + voice preservation) and SeamlessStreaming (low-latency real-time) extend the same family.

Best for research, prototypes, and evaluation where a single model has to span ASR + translation + synthesis across long-tail languages — or where you want to compare against the Whisper family on multilingual S2TT. License warning: the M4T and Streaming checkpoints are CC-BY-NC-4.0 (non-commercial); SeamlessExpressive uses the separate Seamless License. Commercial deployment is not permitted without separate authorisation from Meta. For a permissive multilingual alternative use openai/whisper or faster-whisper. Install via pip install transformers torch torchaudio (HF path) or the fairseq2 source build.

Category
Open source
License
NOASSERTION
Stars
★ 11.8k
Last push
2026-04-08
Pricing
free
Platforms
Linux, GPU

What it is

SeamlessM4T is Meta's "do everything with speech" foundation model. Good for projects that need transcription + translation in the same pass, particularly for languages Whisper covers poorly. Check the license — the research license rules out commercial deployment without separate permission.

Best for: Cross-lingual transcription and translation in one model, especially low-resource languages.
Watch out for: Research license (non-commercial for v1); heavier runtime than Whisper; docs assume fairseq familiarity.

Install / use

pip install fairseq2  # then load SeamlessM4T v2

Pick a task · 5 modes in one model

Every mode below is selected at inference time on the same v2-large checkpoint — you pass the same model id to SeamlessM4Tv2Model.from_pretrained(...) and switch behaviour via generate(..., tgt_lang=..., generate_speech=...) or the --task flag on the m4t_predict CLI. Links point at Meta's research surface, not direct weight downloads.

Speech-to-text (S2TT + ASR)
Multilingual ASR · 101 input languages

Transcribe or translate spoken audio into text. ASR mode keeps the source language; S2TT translates the speech into a different target text in one pass — no separate ASR + MT cascade. The widest input coverage of any current open speech model.

facebook/seamless-m4t-v2-large · task=asr or s2tt
2.3B params · 101 speech-input languages · pass tgt_lang same as source for ASR, different for S2TT · no native diarization / word timestamps
Speech-to-speech (S2ST)
End-to-end speech translation

Translate input audio directly into spoken audio in another language without going through a text intermediary. Output covers 35 speech-output languages — fewer than the input set, but the largest open S2ST footprint available.

facebook/seamless-m4t-v2-large · task=s2st
101 speech-in / 35 speech-out · waveform output via integrated vocoder · generate_speech=True on the HF API
Text-to-speech (T2ST)
Translated speech synthesis

Take written input in one language and produce spoken audio in another. Useful when you already have a text source — captions, articles, prompts — and want narrated output in a target language without a separate TTS engine.

facebook/seamless-m4t-v2-large · task=t2st
96 text-in / 35 speech-out · same vocoder path as S2ST · single-speaker output (no voice cloning here — use SeamlessExpressive for prosody / voice)
Text-to-text (T2TT)
Multilingual machine translation

Plain text in, plain text out — the conventional MT mode. Competitive against NLLB and other open MT baselines on the same language pairs, and useful as a sanity-check head when debugging S2TT quality.

facebook/seamless-m4t-v2-large · task=t2tt
96 text-in / 96 text-out · generate_speech=False · same checkpoint as the speech tasks
Expressive (voice + prosody)
Companion model · SeamlessExpressive

A separate checkpoint that preserves the speaker's voice, rhythm, and emotion when translating speech-to-speech. Smaller language coverage than M4T, and the license is the bespoke Seamless License rather than CC-BY-NC-4.0 — review terms before any deployment.

facebook/seamless-expressive · separate checkpoint, separate license
Voice + emotion preservation · narrower language matrix than v2-large · gated download on HF · Seamless License (non-commercial-leaning, read full text)
License warning · read before any production use: the SeamlessM4T and SeamlessStreaming checkpoints are released under CC-BY-NC-4.0 — the NC clause forbids commercial use. SeamlessExpressive uses the separate Seamless License. Only the surrounding code and the W2v-BERT 2.0 encoder are MIT. If you need commercial multilingual ASR, switch to openai/whisper (MIT) or faster-whisper (MIT). For managed multilingual transcription without the licensing question, Whipscribe handles it for you.

Setup recipes · pick one and copy

Three runnable paths. Recipe 1 is the easiest — HF transformers loads the v2 checkpoint with no extra build. Recipe 2 is the official fairseq2 source path Meta uses internally. Recipe 3 skips installation entirely and runs the model on Meta's hosted HF Space.

1HF transformers (easiest)

pip install transformers torch torchaudio · SeamlessM4Tv2Model + AutoProcessor. Same code handles every task — switch by changing tgt_lang and generate_speech.

# SeamlessM4T v2 via Hugging Face transformers
pip install --upgrade transformers torch torchaudio

# infer.py
import torchaudio
from transformers import AutoProcessor, SeamlessM4Tv2Model

model_id  = "facebook/seamless-m4t-v2-large"
processor = AutoProcessor.from_pretrained(model_id)
model     = SeamlessM4Tv2Model.from_pretrained(model_id)

# ── Speech-to-text translation (S2TT) — English audio -> French text ──
audio, sr = torchaudio.load("meeting.wav")
audio = torchaudio.functional.resample(audio, sr, 16_000)
inputs = processor(audios=audio, return_tensors="pt")
text = model.generate(**inputs, tgt_lang="fra", generate_speech=False)
print(processor.decode(text[0].tolist()[0], skip_special_tokens=True))

# ── ASR (transcribe in source language) ──
asr = model.generate(**inputs, tgt_lang="eng", generate_speech=False)
print(processor.decode(asr[0].tolist()[0], skip_special_tokens=True))

# ── Speech-to-speech translation (S2ST) — returns a waveform ──
wav = model.generate(**inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
2fairseq2 + checkpoint (official path)

The path Meta uses in the paper — clone the repo, install in editable mode, run the m4t_predict CLI. Heavier setup, but you get the full Seamless toolchain (expressive, streaming, eval scripts).

# fairseq2 source install · official Meta CLI
git clone https://github.com/facebookresearch/seamless_communication.git
cd seamless_communication

# fairseq2 wheel is published per torch / cuda; check the README for your combo
pip install -e .

# ── Speech-to-text translation: English wav -> French text ──
m4t_predict input.wav \
    --task s2tt \
    --src_lang eng \
    --tgt_lang fra \
    --model_name seamlessM4T_v2_large

# ── Speech-to-speech: English wav -> Spanish wav ──
m4t_predict input.wav \
    --task s2st \
    --src_lang eng \
    --tgt_lang spa \
    --model_name seamlessM4T_v2_large \
    --output_path translated.wav

# ── Text-to-text: French text -> English text ──
m4t_predict "Bonjour, comment ca va?" \
    --task t2tt \
    --src_lang fra \
    --tgt_lang eng \
    --model_name seamlessM4T_v2_large
3Demo via HF Spaces (no install)

Meta hosts an official Gradio Space — paste a sentence or upload a short clip and inspect the output before committing to a local install. Good for license-conscious evaluation since no weights are downloaded.

# No code path — open in a browser:
#
#   https://huggingface.co/spaces/facebook/seamless_m4t
#
# The Space runs SeamlessM4T v2 with a UI for:
#   - S2ST  (mic / file -> translated audio)
#   - S2TT  (mic / file -> translated text)
#   - T2ST  (text -> translated audio)
#   - T2TT  (text -> translated text)
#
# Tip: the Space surface is rate-limited and occasionally offline
# while Meta rebuilds dependencies. If the demo errors, the next-best
# zero-install option is the companion notebook on the repo:
#   https://github.com/facebookresearch/seamless_communication/tree/main/demo

Features

Speaker diarizationNo
Word-level timestampsNo
Streaming / real-timeYes
Languages supported100
HIPAA eligibleNo

Links

SeamlessM4T vs Whipscribe

FeatureSeamlessM4TWhipscribe
CategoryOpen sourceTranscription APIs
Pricingfreefree beta
Speaker diarizationNoYes
Word timestampsNoYes
StreamingYesNo
Languages10099
PlatformsLinux, GPUWeb, API, MCP

Alternatives to SeamlessM4T

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.