Looking at SeamlessM4T? Try this first.

Drop your audio. Transcript in seconds. 30 free min, then $2 = 200 min

SeamlessM4T

Name: SeamlessM4T
Author: Meta AI

by Meta AI

Meta's foundation model for multilingual ASR, speech-to-speech translation, and text translation in one stack. Research-licensed (CC-BY-NC-4.0).

TL;DR

SeamlessM4T is Meta's unified speech + text translation model from the Seamless Communication project, described in arXiv:2308.11596. The flagship seamless-m4t-v2-large is a 2.3B-parameter checkpoint that runs five tasks in one model — S2ST (speech-to-speech), S2TT (speech-to-text), T2ST (text-to-speech), T2TT (text-to-text), and ASR — covering 101 languages on speech input, 96 on text, and 35 on speech output. Companion models SeamlessExpressive (prosody + voice preservation) and SeamlessStreaming (low-latency real-time) extend the same family.

Best for research, prototypes, and evaluation where a single model has to span ASR + translation + synthesis across long-tail languages — or where you want to compare against the Whisper family on multilingual S2TT. License warning: the M4T and Streaming checkpoints are CC-BY-NC-4.0 (non-commercial); SeamlessExpressive uses the separate Seamless License. Commercial deployment is not permitted without separate authorisation from Meta. For a permissive multilingual alternative use openai/whisper or faster-whisper. Install via pip install transformers torch torchaudio (HF path) or the fairseq2 source build.

What it is

SeamlessM4T is Meta's "do everything with speech" foundation model. Good for projects that need transcription + translation in the same pass, particularly for languages Whisper covers poorly. Check the license — the research license rules out commercial deployment without separate permission.

Best for: Cross-lingual transcription and translation in one model, especially low-resource languages.
Watch out for: Research license (non-commercial for v1); heavier runtime than Whisper; docs assume fairseq familiarity.

Install / use

pip install fairseq2  # then load SeamlessM4T v2

Pick a task · 5 modes in one model

Every mode below is selected at inference time on the same v2-large checkpoint — you pass the same model id to SeamlessM4Tv2Model.from_pretrained(...) and switch behaviour via generate(..., tgt_lang=..., generate_speech=...) or the --task flag on the m4t_predict CLI. Links point at Meta's research surface, not direct weight downloads.

Speech-to-text (S2TT + ASR)

Multilingual ASR · 101 input languages

Transcribe or translate spoken audio into text. ASR mode keeps the source language; S2TT translates the speech into a different target text in one pass — no separate ASR + MT cascade. The widest input coverage of any current open speech model.

facebook/seamless-m4t-v2-large · task=asr or s2tt
2.3B params · 101 speech-input languages · pass tgt_lang same as source for ASR, different for S2TT · no native diarization / word timestamps

Speech-to-speech (S2ST)

End-to-end speech translation

Translate input audio directly into spoken audio in another language without going through a text intermediary. Output covers 35 speech-output languages — fewer than the input set, but the largest open S2ST footprint available.

facebook/seamless-m4t-v2-large · task=s2st
101 speech-in / 35 speech-out · waveform output via integrated vocoder · generate_speech=True on the HF API

Text-to-speech (T2ST)

Translated speech synthesis

Take written input in one language and produce spoken audio in another. Useful when you already have a text source — captions, articles, prompts — and want narrated output in a target language without a separate TTS engine.

facebook/seamless-m4t-v2-large · task=t2st
96 text-in / 35 speech-out · same vocoder path as S2ST · single-speaker output (no voice cloning here — use SeamlessExpressive for prosody / voice)

Text-to-text (T2TT)

Multilingual machine translation

Plain text in, plain text out — the conventional MT mode. Competitive against NLLB and other open MT baselines on the same language pairs, and useful as a sanity-check head when debugging S2TT quality.

facebook/seamless-m4t-v2-large · task=t2tt
96 text-in / 96 text-out · generate_speech=False · same checkpoint as the speech tasks

Expressive (voice + prosody)

Companion model · SeamlessExpressive

A separate checkpoint that preserves the speaker's voice, rhythm, and emotion when translating speech-to-speech. Smaller language coverage than M4T, and the license is the bespoke Seamless License rather than CC-BY-NC-4.0 — review terms before any deployment.

facebook/seamless-expressive · separate checkpoint, separate license
Voice + emotion preservation · narrower language matrix than v2-large · gated download on HF · Seamless License (non-commercial-leaning, read full text)

License warning · read before any production use: the SeamlessM4T and SeamlessStreaming checkpoints are released under CC-BY-NC-4.0 — the NC clause forbids commercial use. SeamlessExpressive uses the separate Seamless License. Only the surrounding code and the W2v-BERT 2.0 encoder are MIT. If you need commercial multilingual ASR, switch to openai/whisper (MIT) or faster-whisper (MIT). For managed multilingual transcription without the licensing question, Whipscribe handles it for you.

Setup recipes · pick one and copy

Three runnable paths. Recipe 1 is the easiest — HF transformers loads the v2 checkpoint with no extra build. Recipe 2 is the official fairseq2 source path Meta uses internally. Recipe 3 skips installation entirely and runs the model on Meta's hosted HF Space.

1HF transformers (easiest)

pip install transformers torch torchaudio · SeamlessM4Tv2Model + AutoProcessor. Same code handles every task — switch by changing tgt_lang and generate_speech.

# SeamlessM4T v2 via Hugging Face transformers
pip install --upgrade transformers torch torchaudio

# infer.py
import torchaudio
from transformers import AutoProcessor, SeamlessM4Tv2Model

model_id  = "facebook/seamless-m4t-v2-large"
processor = AutoProcessor.from_pretrained(model_id)
model     = SeamlessM4Tv2Model.from_pretrained(model_id)

# ── Speech-to-text translation (S2TT) — English audio -> French text ──
audio, sr = torchaudio.load("meeting.wav")
audio = torchaudio.functional.resample(audio, sr, 16_000)
inputs = processor(audios=audio, return_tensors="pt")
text = model.generate(**inputs, tgt_lang="fra", generate_speech=False)
print(processor.decode(text[0].tolist()[0], skip_special_tokens=True))

# ── ASR (transcribe in source language) ──
asr = model.generate(**inputs, tgt_lang="eng", generate_speech=False)
print(processor.decode(asr[0].tolist()[0], skip_special_tokens=True))

# ── Speech-to-speech translation (S2ST) — returns a waveform ──
wav = model.generate(**inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

Source: facebook/seamless-m4t-v2-large model card ↗. Transformers docs: huggingface.co/docs/transformers/model_doc/seamless_m4t_v2 ↗

2fairseq2 + checkpoint (official path)

The path Meta uses in the paper — clone the repo, install in editable mode, run the m4t_predict CLI. Heavier setup, but you get the full Seamless toolchain (expressive, streaming, eval scripts).

# fairseq2 source install · official Meta CLI
git clone https://github.com/facebookresearch/seamless_communication.git
cd seamless_communication

# fairseq2 wheel is published per torch / cuda; check the README for your combo
pip install -e .

# ── Speech-to-text translation: English wav -> French text ──
m4t_predict input.wav \
    --task s2tt \
    --src_lang eng \
    --tgt_lang fra \
    --model_name seamlessM4T_v2_large

# ── Speech-to-speech: English wav -> Spanish wav ──
m4t_predict input.wav \
    --task s2st \
    --src_lang eng \
    --tgt_lang spa \
    --model_name seamlessM4T_v2_large \
    --output_path translated.wav

# ── Text-to-text: French text -> English text ──
m4t_predict "Bonjour, comment ca va?" \
    --task t2tt \
    --src_lang fra \
    --tgt_lang eng \
    --model_name seamlessM4T_v2_large

Source: facebookresearch/seamless_communication · Quick start ↗. fairseq2 install matrix: facebookresearch/fairseq2 ↗

3Demo via HF Spaces (no install)

Meta hosts an official Gradio Space — paste a sentence or upload a short clip and inspect the output before committing to a local install. Good for license-conscious evaluation since no weights are downloaded.

# No code path — open in a browser:
#
#   https://huggingface.co/spaces/facebook/seamless_m4t
#
# The Space runs SeamlessM4T v2 with a UI for:
#   - S2ST  (mic / file -> translated audio)
#   - S2TT  (mic / file -> translated text)
#   - T2ST  (text -> translated audio)
#   - T2TT  (text -> translated text)
#
# Tip: the Space surface is rate-limited and occasionally offline
# while Meta rebuilds dependencies. If the demo errors, the next-best
# zero-install option is the companion notebook on the repo:
#   https://github.com/facebookresearch/seamless_communication/tree/main/demo

Demo: huggingface.co/spaces/facebook/seamless_m4t ↗. Streaming demo: facebook/seamless-streaming ↗

Features

Speaker diarization	No
Word-level timestamps	No
Streaming / real-time	Yes
Languages supported	100
HIPAA eligible	No

SeamlessM4T vs Whipscribe

Feature	SeamlessM4T	Whipscribe
Category	Open source	Transcription APIs
Pricing	free	free beta
Speaker diarization	No	Yes
Word timestamps	No	Yes
Streaming	Yes	No
Languages	100	99
Platforms	Linux, GPU	Web, API, MCP

Alternatives to SeamlessM4T

OpenAI Whisper

OpenAI

The reference open-source multilingual ASR model from OpenAI.

OSS · MIT ★ 98.1k

whisper.cpp

Georgi Gerganov

C/C++ port of Whisper — runs on anything, from a Raspberry Pi to Apple Silicon.

OSS · MIT ★ 48.8k

faster-whisper

SYSTRAN

4× faster than reference Whisper using CTranslate2 — production sweet spot.

OSS · MIT ★ 22.3k

Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.