• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用新颖的神经机器翻译-统计机器翻译混合框架进行低资源语言翻译的高效增量训练。

Efficient incremental training using a novel NMT-SMT hybrid framework for translation of low-resource languages.

作者信息

Bhuvaneswari Kumar, Varalakshmi Murugesan

机构信息

School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore, Tamil Nadu, India.

School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India.

出版信息

Front Artif Intell. 2024 Sep 25;7:1381290. doi: 10.3389/frai.2024.1381290. eCollection 2024.

DOI:10.3389/frai.2024.1381290
PMID:39386916
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11461459/
Abstract

The data-hungry statistical machine translation (SMT) and neural machine translation (NMT) models offer state-of-the-art results for languages with abundant data resources. However, extensive research is imperative to make these models perform equally well for low-resource languages. This paper proposes a novel approach to integrate the best features of the NMT and SMT systems for improved translation performance of low-resource English-Tamil language pair. The suboptimal NMT model trained with the small parallel corpus translates the monolingual corpus and selects only the best translations, to retrain itself in the next iteration. The proposed method employs the SMT phrase-pair table to determine the best translations, based on the maximum match between the words of the phrase-pair dictionary and each of the individual translations. This repeating cycle of translation and retraining generates a large quasi-parallel corpus, thus making the NMT model more powerful. SMT-integrated incremental training demonstrates a substantial difference in translation performance as compared to the existing approaches for incremental training. The model is strengthened further by adopting a beam search decoding strategy to produce best possible translations for each input sentence. Empirical findings prove that the proposed model with BLEU scores of 19.56 and 23.49 outperforms the baseline NMT with scores 11.06 and 17.06 for Eng-to-Tam and Tam-to-Eng translations, respectively. METEOR score evaluation further corroborates these results, proving the supremacy of the proposed model.

摘要

对数据需求极大的统计机器翻译(SMT)和神经机器翻译(NMT)模型在拥有丰富数据资源的语言上能提供最先进的结果。然而,开展广泛研究以使这些模型在低资源语言上同样表现出色势在必行。本文提出了一种新颖的方法,将NMT和SMT系统的最佳特性整合起来,以提高低资源英语 - 泰米尔语对的翻译性能。用小并行语料库训练的次优NMT模型对单语语料库进行翻译,只选择最佳译文,以便在下一次迭代中重新训练自身。所提出的方法利用SMT短语对表,基于短语对字典中的单词与各个单独译文之间的最大匹配来确定最佳译文。这种翻译和重新训练的重复循环生成了一个大型准并行语料库,从而使NMT模型更强大。与现有的增量训练方法相比,集成SMT的增量训练在翻译性能上显示出显著差异。通过采用束搜索解码策略为每个输入句子生成尽可能好的译文,该模型得到了进一步强化。实证结果证明,所提出的模型在英语到泰米尔语和泰米尔语到英语翻译中,BLEU分数分别为19.56和23.49,优于基线NMT的分数11.06和17.06。METEOR分数评估进一步证实了这些结果,证明了所提出模型的优越性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/4206349ea270/frai-07-1381290-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/ee526c02df36/frai-07-1381290-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/97d3fe4a44b0/frai-07-1381290-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/f37b7228aeb4/frai-07-1381290-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/301429c22ddd/frai-07-1381290-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/b9bcd47f7066/frai-07-1381290-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/f04138b56bc0/frai-07-1381290-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/bc97475c0daa/frai-07-1381290-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/4206349ea270/frai-07-1381290-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/ee526c02df36/frai-07-1381290-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/97d3fe4a44b0/frai-07-1381290-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/f37b7228aeb4/frai-07-1381290-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/301429c22ddd/frai-07-1381290-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/b9bcd47f7066/frai-07-1381290-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/f04138b56bc0/frai-07-1381290-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/bc97475c0daa/frai-07-1381290-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea04/11461459/4206349ea270/frai-07-1381290-g008.jpg

相似文献

1
Efficient incremental training using a novel NMT-SMT hybrid framework for translation of low-resource languages.使用新颖的神经机器翻译-统计机器翻译混合框架进行低资源语言翻译的高效增量训练。
Front Artif Intell. 2024 Sep 25;7:1381290. doi: 10.3389/frai.2024.1381290. eCollection 2024.
2
Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation.用于神经机器翻译的低资源语料库的伪文本注入与预过滤
Comput Intell Neurosci. 2021 Apr 11;2021:6682385. doi: 10.1155/2021/6682385. eCollection 2021.
3
Improving neural machine translation with POS-tag features for low-resource language pairs.利用词性标注特征改进低资源语言对的神经机器翻译。
Heliyon. 2022 Aug 22;8(8):e10375. doi: 10.1016/j.heliyon.2022.e10375. eCollection 2022 Aug.
4
The neural machine translation models for the low-resource Kazakh-English language pair.针对低资源哈萨克语-英语语言对的神经机器翻译模型。
PeerJ Comput Sci. 2023 Feb 8;9:e1224. doi: 10.7717/peerj-cs.1224. eCollection 2023.
5
Scaling neural machine translation to 200 languages.将神经机器翻译扩展到 200 种语言。
Nature. 2024 Jun;630(8018):841-846. doi: 10.1038/s41586-024-07335-x. Epub 2024 Jun 5.
6
English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus.基于迁移学习和英汉双语语料库的英汉机器翻译。
Comput Intell Neurosci. 2022 Sep 27;2022:1563731. doi: 10.1155/2022/1563731. eCollection 2022.
7
Domain adaptation of statistical machine translation with domain-focused web crawling.通过聚焦领域的网络爬虫实现统计机器翻译的领域适应
Lang Resour Eval. 2015;49(1):147-193. doi: 10.1007/s10579-014-9282-3.
8
A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking.A7׳ta:关于用于语法检查的单语阿拉伯语平行语料库的数据。 (注:这里的“A7׳ta”可能是特定的名称或术语,由于不清楚其确切含义,所以保留原样翻译)
Data Brief. 2018 Dec 4;22:237-240. doi: 10.1016/j.dib.2018.11.146. eCollection 2019 Feb.
9
Neural machine translation of clinical texts between long distance languages.长距离语言之间的临床文本的神经机器翻译。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1478-1487. doi: 10.1093/jamia/ocz110.
10
MetaMT, a Meta Learning Method Leveraging Multiple Domain Data for Low Resource Machine Translation.MetaMT,一种利用多领域数据进行低资源机器翻译的元学习方法。
Proc AAAI Conf Artif Intell. 2020;34(5):8245-8252. doi: 10.1609/aaai.v34i05.6339. Epub 2020 Apr 3.