• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于频繁项集概念的机器翻译。

Machine Translation Utilizing the Frequent-Item Set Concept.

机构信息

Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh P.O. Box 11671, Saudi Arabia.

Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh P.O. Box 11671, Saudi Arabia.

出版信息

Sensors (Basel). 2021 Feb 21;21(4):1493. doi: 10.3390/s21041493.

DOI:10.3390/s21041493
PMID:33670035
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7926351/
Abstract

In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5-20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.

摘要

在本文中,我们引入了机器翻译范式中的新概念。我们将语料库视为频繁词集的数据库。翻译请求会触发将源语言中出现的短语和目标语言中出现的短语联系起来的关联规则。需要注意的是,对语料库进行这种短语的顺序扫描会以意想不到的方式增加响应时间。我们通过提出一种称为语料库 trie (CT) 的数据结构来预处理双语语料库,该数据结构将双语平行语料库表示为一个紧凑的数据结构,代表频繁的数据项集。我们还提出了利用 CT 来响应翻译请求并在详尽的实验中探索新的技术的算法。实验是在特定的语言对上进行的,尽管所提出的方法不限于任何特定的语言。此外,所提出的语料库 trie 可以从双语语料库扩展到多语言语料库。实验表明,翻译请求的响应时间与原始双语语料库中未重复短语的数量的对数(因此,语料库 trie 的大小)成对数关系。在实际情况下,必须访问原始双语语料库中未重复短语数量的日志的 5-20%。实验结果表明,对于英语-阿拉伯语和英语-法语翻译,所提出的 CT 系统的 BLEU 分数随着 CT 中短语数量的增加而增加。对于英语-阿拉伯语翻译超过 160 万短语和英语-法语翻译超过 30 万短语的语料库大小,所提出的 CT 系统在翻译质量上优于 Omega-T 和 Apertium。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/857d3acf1608/sensors-21-01493-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/23cc5cab1d1b/sensors-21-01493-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/a5ebb17f7bd9/sensors-21-01493-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/624c1d208403/sensors-21-01493-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/b39eee306beb/sensors-21-01493-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/4f5fae067044/sensors-21-01493-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/b71a1bca838e/sensors-21-01493-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/5c3d0793616e/sensors-21-01493-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/1f17387e25d4/sensors-21-01493-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/197632dfc01d/sensors-21-01493-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/9e72a2ebf6b0/sensors-21-01493-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/5d08a6f18054/sensors-21-01493-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/857d3acf1608/sensors-21-01493-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/23cc5cab1d1b/sensors-21-01493-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/a5ebb17f7bd9/sensors-21-01493-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/624c1d208403/sensors-21-01493-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/b39eee306beb/sensors-21-01493-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/4f5fae067044/sensors-21-01493-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/b71a1bca838e/sensors-21-01493-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/5c3d0793616e/sensors-21-01493-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/1f17387e25d4/sensors-21-01493-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/197632dfc01d/sensors-21-01493-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/9e72a2ebf6b0/sensors-21-01493-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/5d08a6f18054/sensors-21-01493-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b8e/7926351/857d3acf1608/sensors-21-01493-g012.jpg

相似文献

1
Machine Translation Utilizing the Frequent-Item Set Concept.基于频繁项集概念的机器翻译。
Sensors (Basel). 2021 Feb 21;21(4):1493. doi: 10.3390/s21041493.
2
English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus.基于迁移学习和英汉双语语料库的英汉机器翻译。
Comput Intell Neurosci. 2022 Sep 27;2022:1563731. doi: 10.1155/2022/1563731. eCollection 2022.
3
Machine Translation System Using Deep Learning for English to Urdu.基于深度学习的英语到乌尔都语机器翻译系统。
Comput Intell Neurosci. 2022 Jan 3;2022:7873012. doi: 10.1155/2022/7873012. eCollection 2022.
4
Neural machine translation of clinical texts between long distance languages.长距离语言之间的临床文本的神经机器翻译。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1478-1487. doi: 10.1093/jamia/ocz110.
5
Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation.用于神经机器翻译的低资源语料库的伪文本注入与预过滤
Comput Intell Neurosci. 2021 Apr 11;2021:6682385. doi: 10.1155/2021/6682385. eCollection 2021.
6
Adaptation of machine translation for multilingual information retrieval in the medical domain.医学领域中用于多语言信息检索的机器翻译适配
Artif Intell Med. 2014 Jul;61(3):165-85. doi: 10.1016/j.artmed.2014.01.004. Epub 2014 Feb 5.
7
Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.开发和评估 RapTAT:一种用于从医学叙述中映射短语概念的机器学习系统。
J Biomed Inform. 2014 Apr;48:54-65. doi: 10.1016/j.jbi.2013.11.008. Epub 2013 Dec 4.
8
Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.利用标注保留的机器翻译将英文语料库翻译为荷兰文,以验证荷兰临床概念提取工具。
J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.
9
Alignment-Supervised Bidimensional Attention-Based Recursive Autoencoders for Bilingual Phrase Representation.基于对齐监督的二维注意力递归自编码器的双语短语表示。
IEEE Trans Cybern. 2020 Feb;50(2):503-513. doi: 10.1109/TCYB.2018.2868982. Epub 2018 Sep 26.
10
Machine translation training data for English-Tshivenḓa.英语-茨瓦纳语的机器翻译训练数据。
Data Brief. 2024 Sep 7;57:110898. doi: 10.1016/j.dib.2024.110898. eCollection 2024 Dec.