Suppr超能文献

将神经机器翻译扩展到 200 种语言。

Scaling neural machine translation to 200 languages.

出版信息

Nature. 2024 Jun;630(8018):841-846. doi: 10.1038/s41586-024-07335-x. Epub 2024 Jun 5.

Abstract

The development of neural techniques has opened up new avenues for research in machine translation. Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results in terms of language coverage and quality. However, scaling quality NMT requires large volumes of parallel bilingual data, which are not equally available for the 7,000+ languages in the world. Focusing on improving the translation qualities of a relatively small group of high-resource languages comes at the expense of directing research attention to low-resource languages, exacerbating digital inequities in the long run. To break this pattern, here we introduce No Language Left Behind-a single massively multilingual model that leverages transfer learning across languages. We developed a conditional computational model based on the Sparsely Gated Mixture of Experts architecture, which we trained on data obtained with new mining techniques tailored for low-resource languages. Furthermore, we devised multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. We evaluated the performance of our model over 40,000 translation directions using tools created specifically for this purpose-an automatic benchmark (FLORES-200), a human evaluation metric (XSTS) and a toxicity detector that covers every language in our model. Compared with the previous state-of-the-art models, our model achieves an average of 44% improvement in translation quality as measured by BLEU. By demonstrating how to scale NMT to 200 languages and making all contributions in this effort freely available for non-commercial use, our work lays important groundwork for the development of a universal translation system.

摘要

神经技术的发展为机器翻译的研究开辟了新的途径。如今,神经机器翻译 (NMT) 系统可以利用高度多语言能力,甚至可以进行零样本翻译,在语言覆盖范围和质量方面取得了有希望的结果。然而,提高质量 NMT 需要大量的平行双语数据,而这些数据在全球 7000 多种语言中并不是平等可用的。专注于提高相对少数高资源语言的翻译质量,会导致研究注意力转向低资源语言,从长远来看,加剧数字不平等。为了打破这种模式,我们在这里引入了“一个都不能落下”(No Language Left Behind)——一个利用跨语言迁移学习的单一大规模多语言模型。我们基于稀疏门控混合专家(Sparsely Gated Mixture of Experts)架构开发了一个条件计算模型,该模型基于针对低资源语言定制的新挖掘技术进行训练。此外,我们设计了多种架构和训练改进措施,以在数千个任务上进行训练时对抗过拟合。我们使用专门为此目的创建的工具(自动基准测试工具 FLORES-200、人类评估指标 XSTS 和涵盖我们模型中所有语言的毒性检测器),在超过 40000 个翻译方向上评估了我们模型的性能。与之前的最先进模型相比,我们的模型在 BLEU 衡量的翻译质量上平均提高了 44%。通过展示如何将 NMT 扩展到 200 种语言,并免费提供这项工作中的所有贡献供非商业使用,我们的工作为开发通用翻译系统奠定了重要基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4df2/11208141/df20fc92ef95/41586_2024_7335_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验