Volcengine Speech (ByteDance)
ByteDance's Volcengine speech-to-text platform powering Douyin/CapCut workflows.
ByteDance's Volcengine speech-to-text platform powering Douyin/CapCut workflows.
Best for short-form video pipelines (Douyin/TikTok-style) needing fast Mandarin captioning + voice intel. Pricing: tiered per-second (RMB).
What it is
Volcengine is ByteDance's enterprise cloud platform. Its speech suite covers real-time streaming ASR, batch long-audio recognition, voice cloning, and TTS. ByteDance uses these internally for Douyin (China TikTok), CapCut captioning, and Lark's meeting transcription. The Mandarin model is among the strongest in the China market; English and several other languages are available in specific endpoints. International access is via the Volcengine International console with separate pricing. Best fit: short-form video pipelines (douyin/tiktok-style) needing fast mandarin captioning + voice intel. Caveats: china-first; international tenants must use the volcengine international console; some endpoints are gated to enterprise contracts. Pricing as listed: tiered per-second (RMB). Feature flags from vendor docs: speaker diarization, word-level timestamps, streaming. Directory tags: commercial-api, regional-asia. Last vendor-page check: 2026-05-12.
Watch out for: China-first; international tenants must use the Volcengine International console; some endpoints are gated to enterprise contracts.
Install / use
Volcengine SDK: AsrClient.submit_task(...)
Features
| Speaker diarization | Yes |
| Word-level timestamps | Yes |
| Streaming / real-time | Yes |
| Languages supported | None |
| HIPAA eligible | No |
Volcengine Speech (ByteDance) vs Whipscribe
| Feature | Volcengine Speech (ByteDance) | Whipscribe |
|---|---|---|
| Category | Transcription APIs | Transcription APIs |
| Pricing | tiered per-second (RMB) | free beta |
| Speaker diarization | Yes | Yes |
| Word timestamps | Yes | Yes |
| Streaming | Yes | No |
| Languages | — | 99 |
| Platforms | API | Web, API, MCP |
Alternatives to Volcengine Speech (ByteDance)
Whipscribe is a managed faster-whisper + whisperX service. If you want transcripts without running infrastructure, paste a URL or drop a file in the form below — you'll have a transcript in seconds.