• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ECCParaCorp:面向癌症教育、传播和应用的跨语言平行语料库。

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application.

机构信息

Institute of Medical Information/Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.

Office of Cancer Screening, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.

出版信息

BMC Med Inform Decis Mak. 2020 Jul 9;20(Suppl 3):122. doi: 10.1186/s12911-020-1116-1.

DOI:10.1186/s12911-020-1116-1
PMID:32646415
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7346326/
Abstract

BACKGROUND

The increasing global cancer incidence corresponds to serious health impact in countries worldwide. Knowledge-powered health system in different languages would enhance clinicians' healthcare practice, patients' health management and public health literacy. High-quality corpus containing cancer information is the necessary foundation of cancer education. Massive non-structural information resources exist in clinical narratives, electronic health records (EHR) etc. They can only be used for training AI models after being transformed into structured corpus. However, the scarcity of multilingual cancer corpus limits the intelligent processing, such as machine translation in medical scenarios. Thus, we created the cancer specific cross-lingual corpus and open it to the public for academic use.

METHODS

Aiming to build an English-Chinese cancer parallel corpus, we developed a workflow of seven steps including data retrieval, data parsing, data processing, corpus implementation, assessment verification, corpus release, and application. We applied the workflow to a cross-lingual, comprehensive and authoritative cancer information resource, PDQ (Physician Data Query). We constructed, validated and released the parallel corpus named as ECCParaCorp, made it openly accessible online.

RESULTS

The proposed English-Chinese Cancer Parallel Corpus (ECCParaCorp) consists of 6685 aligned text pairs in Xml, Excel, Csv format, containing 5190 sentence pairs, 1083 phrase pairs and 412 word pairs, which involved information of 6 cancers including breast cancer, liver cancer, lung cancer, esophageal cancer, colorectal cancer, and stomach cancer, and 3 cancer themes containing cancer prevention, screening, and treatment. All data in the parallel corpus are online, available for users to browse and download ( http://www.phoc.org.cn/ECCParaCorp/ ).

CONCLUSIONS

ECCParaCorp is a parallel corpus focused on cancer in a cross-lingual form, which is openly accessible. It would make up the imbalance of scarce multilingual corpus resources, bridge the gap between human readable information and machine understanding data resources, and would contribute to intelligent technology application as a preparatory data foundation e.g. cancer-related machine translation, cancer system development towards medical education, and disease-oriented knowledge extraction.

摘要

背景

全球癌症发病率的不断增加对应着全球各国严重的健康影响。不同语言的知识驱动型卫生系统将增强临床医生的医疗实践、患者的健康管理和公众的健康素养。高质量的包含癌症信息的语料库是癌症教育的必要基础。大量非结构化信息资源存在于临床叙述、电子健康记录 (EHR) 等中。在将其转换为结构化语料库之前,它们只能用于训练 AI 模型。然而,多语言癌症语料库的稀缺限制了智能处理,例如医学场景中的机器翻译。因此,我们创建了特定于癌症的跨语言语料库并向公众开放供学术使用。

方法

为了构建英中癌症平行语料库,我们开发了一个包含七个步骤的工作流程,包括数据检索、数据解析、数据处理、语料库实现、评估验证、语料库发布和应用。我们将该工作流程应用于跨语言、全面和权威的癌症信息资源 PDQ(医生数据查询)。我们构建、验证和发布了名为 ECCParaCorp 的平行语料库,并在网上公开提供。

结果

提出的英中癌症平行语料库 (ECCParaCorp) 由 6685 对以 Xml、Excel、Csv 格式对齐的文本对组成,包含 5190 个句子对、1083 个短语对和 412 个单词对,涉及乳腺癌、肝癌、肺癌、食管癌、结直肠癌和胃癌等 6 种癌症的信息,以及癌症预防、筛查和治疗等 3 个癌症主题。平行语料库中的所有数据都在线,供用户浏览和下载(http://www.phoc.org.cn/ECCParaCorp/)。

结论

ECCParaCorp 是一个专注于跨语言癌症的平行语料库,可公开访问。它将弥补多语言语料库资源稀缺的不平衡,弥合人类可读信息与机器理解数据资源之间的差距,并为智能技术应用做出贡献,例如癌症相关的机器翻译、面向医学教育的癌症系统开发和面向疾病的知识提取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/2b044bb76445/12911_2020_1116_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/39ee18e339f3/12911_2020_1116_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/96087828b71d/12911_2020_1116_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/63e134c0b0fd/12911_2020_1116_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/5be8ec28612b/12911_2020_1116_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/718fb5e052d5/12911_2020_1116_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/b0803ec073b7/12911_2020_1116_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/2b044bb76445/12911_2020_1116_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/39ee18e339f3/12911_2020_1116_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/96087828b71d/12911_2020_1116_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/63e134c0b0fd/12911_2020_1116_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/5be8ec28612b/12911_2020_1116_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/718fb5e052d5/12911_2020_1116_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/b0803ec073b7/12911_2020_1116_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de9d/7346326/2b044bb76445/12911_2020_1116_Fig7_HTML.jpg

相似文献

1
ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application.ECCParaCorp:面向癌症教育、传播和应用的跨语言平行语料库。
BMC Med Inform Decis Mak. 2020 Jul 9;20(Suppl 3):122. doi: 10.1186/s12911-020-1116-1.
2
A Cross-Lingual Effort Towards Managing English-Chinese Cancer Education Resources.一项管理英汉癌症教育资源的跨语言工作。
Stud Health Technol Inform. 2019 Aug 21;264:1534-1535. doi: 10.3233/SHTI190521.
3
Adaptation of machine translation for multilingual information retrieval in the medical domain.医学领域中用于多语言信息检索的机器翻译适配
Artif Intell Med. 2014 Jul;61(3):165-85. doi: 10.1016/j.artmed.2014.01.004. Epub 2014 Feb 5.
4
COVID term: a bilingual terminology for COVID-19.新冠术语:COVID-19 的双语术语。
BMC Med Inform Decis Mak. 2021 Aug 3;21(1):231. doi: 10.1186/s12911-021-01593-9.
5
Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval.多语言医学术语的自动处理:在叙词表扩充和跨语言信息检索中的应用
Artif Intell Med. 2005 Feb;33(2):111-24. doi: 10.1016/j.artmed.2004.07.015.
6
Experiments in cross-language medical information retrieval using a mixing translation module.使用混合翻译模块进行跨语言医学信息检索的实验
Stud Health Technol Inform. 2004;107(Pt 2):946-9.
7
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
8
On the Construction of Multilingual Corpora for Clinical Text Mining.关于用于临床文本挖掘的多语言语料库的构建
Stud Health Technol Inform. 2020 Jun 16;270:347-351. doi: 10.3233/SHTI200180.
9
Obtaining Parallel Sentences in Low-Resource Language Pairs with Minimal Supervision.用最小监督获取低资源语言对的平行句子。
Comput Intell Neurosci. 2022 Aug 3;2022:5296946. doi: 10.1155/2022/5296946. eCollection 2022.
10
ParaMed: a parallel corpus for English-Chinese translation in the biomedical domain.ParaMed:一个用于生物医学领域英汉翻译的平行语料库。
BMC Med Inform Decis Mak. 2021 Sep 6;21(1):258. doi: 10.1186/s12911-021-01621-8.

引用本文的文献

1
Toward clearer recognition and easier usefulness: development of a cross-lingual atherosclerotic cerebrovascular disease ontology.迈向更清晰的认知与更便捷的应用:跨语言动脉粥样硬化性脑血管疾病本体的开发
Database (Oxford). 2024 Dec 5;2024. doi: 10.1093/database/baae117.
2
Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.探索多语言医学自然语言处理的最新亮点:综述。
Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26.

本文引用的文献

1
A Cross-Lingual Effort Towards Managing English-Chinese Cancer Education Resources.一项管理英汉癌症教育资源的跨语言工作。
Stud Health Technol Inform. 2019 Aug 21;264:1534-1535. doi: 10.3233/SHTI190521.
2
Cancer statistics, 2019.癌症统计数据,2019 年。
CA Cancer J Clin. 2019 Jan;69(1):7-34. doi: 10.3322/caac.21551. Epub 2019 Jan 8.
3
Advances in natural language processing.自然语言处理的进展。
Science. 2015 Jul 17;349(6245):261-6. doi: 10.1126/science.aaa8685.
4
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.用于生物医学概念识别的多语言金标准语料库:Mantra GSC。
J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.
5
Physician Data Query (PDQ(R)) update.医师数据查询(PDQ(R))更新。
J Natl Cancer Inst. 2012 May 2;104(9):655-6. doi: 10.1093/jnci/djs231. Epub 2012 Apr 18.