Nature. 2025 Jan;637(8046):587-593. doi: 10.1038/s41586-024-08359-z. Epub 2025 Jan 15.
Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of multiple subsystems performing translation in a cascaded fashion exist, scalable and high-performing unified systems remain underexplored. To address this gap, here we introduce SEAMLESSM4T-Massively Multilingual and Multimodal Machine Translation-a single model that supports speech-to-speech translation (101 to 36 languages), speech-to-text translation (from 101 to 96 languages), text-to-speech translation (from 96 to 36 languages), text-to-text translation (96 languages) and automatic speech recognition (96 languages). Built using a new multimodal corpus of automatically aligned speech translations and other publicly available data, SEAMLESSM4T is one of the first multilingual systems that can translate from and into English for both speech and text. Moreover, it outperforms the existing state-of-the-art cascaded systems, achieving up to 8% and 23% higher BLEU (Bilingual Evaluation Understudy) scores in speech-to-text and speech-to-speech tasks, respectively. Beyond quality, when tested for robustness, our system is, on average, approximately 50% more resilient against background noise and speaker variations in speech-to-text tasks than the previous state-of-the-art systems. We evaluated SEAMLESSM4T on added toxicity and gender bias to assess translation safety. For the former, we included two strategies for added toxicity mitigation working at either training or inference time. Finally, all contributions in this work are publicly available for non-commercial use to propel further research on inclusive speech translation technologies.
创造巴别鱼(一种帮助人们在任意两种语言之间进行语音翻译的工具)需要先进的技术创新和语言专业知识。尽管存在由多个子系统以级联方式执行翻译的传统语音到语音翻译系统,但可扩展且高性能的统一系统仍未得到充分探索。为了弥补这一差距,我们在此引入SEAMLESSM4T——大规模多语言多模态机器翻译,这是一个支持语音到语音翻译(101种到36种语言)、语音到文本翻译(101种到96种语言)、文本到语音翻译(96种到36种语言)、文本到文本翻译(96种语言)以及自动语音识别(96种语言)的单一模型。SEAMLESSM4T使用自动对齐的语音翻译新多模态语料库和其他公开可用数据构建而成,是首批能够在语音和文本方面进行英语翻译的多语言系统之一。此外,它的性能优于现有的最先进级联系统,在语音到文本和语音到语音任务中分别实现了高达8%和23%的更高双语评估替补(BLEU)分数。除了质量之外,在进行鲁棒性测试时,我们的系统在语音到文本任务中平均比之前的最先进系统对背景噪声和说话者变化的弹性高出约50%。我们对SEAMLESSM4T进行了额外毒性和性别偏见方面的评估,以评估翻译安全性。对于前者,我们纳入了在训练或推理时减轻额外毒性的两种策略。最后,这项工作中的所有贡献都可供非商业使用,以推动对包容性语音翻译技术的进一步研究。