• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用标注保留的机器翻译将英文语料库翻译为荷兰文,以验证荷兰临床概念提取工具。

Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.

出版信息

J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.

DOI:10.1093/jamia/ocae159
PMID:38934643
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11258409/
Abstract

OBJECTIVE

To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora.

MATERIALS AND METHODS

Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English.

RESULTS

The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision.

DISCUSSION

Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools.

CONCLUSION

This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.

摘要

目的

探索使用已标注的英文翻译语料库验证荷兰概念提取工具的可行性,重点是在翻译过程中保留标注并解决非英文标注临床语料库稀缺的问题。

材料与方法

使用两种机器翻译服务(谷歌翻译和 OpenAI GPT-4)对三个标注语料库进行标准化和英文到荷兰文的翻译,通过在翻译前将标注嵌入文本的建议方法来保留标注。评估了 MedSpaCy 和 MedCAT 两种概念提取工具在荷兰语和英语语料库中的性能。

结果

翻译过程有效地生成了荷兰语标注语料库,并且概念提取工具在英语和荷兰语中表现相似。尽管在翻译过程中保留标注的方式存在一些差异,但这些差异并未影响提取准确性。有监督的 MedCAT 模型始终优于无监督模型,而 MedSpaCy 则表现出较高的召回率但较低的精度。

讨论

我们对从英语翻译而来的语料库中的荷兰语概念提取工具进行了验证,这表明我们的标注保留方法有效,并且有可能高效地创建多语言语料库。进一步改进和比较标注保留技术以及语料库合成策略,可以促进多语言语料库的高效开发和非英语概念提取工具的准确性。

结论

本研究表明,可以使用翻译后的英语语料库来验证非英语概念提取工具。在翻译过程中使用的标注保留方法效果良好,未来的研究可以将这种语料库翻译方法应用于其他语言和临床环境。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/35c077784a5c/ocae159f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/337a50af9044/ocae159f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/fdde8622899a/ocae159f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/35c077784a5c/ocae159f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/337a50af9044/ocae159f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/fdde8622899a/ocae159f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c376/11258409/35c077784a5c/ocae159f3.jpg

相似文献

1
Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.利用标注保留的机器翻译将英文语料库翻译为荷兰文,以验证荷兰临床概念提取工具。
J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.
2
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.用于生物医学概念识别的多语言金标准语料库:Mantra GSC。
J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.
3
Inventory of tools for Dutch clinical language processing.荷兰临床语言处理工具清单。
Stud Health Technol Inform. 2012;180:245-9.
4
Exploiting and assessing multi-source data for supervised biomedical named entity recognition.利用和评估多源数据进行有监督的生物医学命名实体识别。
Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.
5
Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.从与表型特别相关的生物医学文本中生成银标准概念注释。
PLoS One. 2015 Jan 21;10(1):e0116040. doi: 10.1371/journal.pone.0116040. eCollection 2015.
6
Automated content analysis across six languages.跨六种语言的自动化内容分析。
PLoS One. 2019 Nov 20;14(11):e0224425. doi: 10.1371/journal.pone.0224425. eCollection 2019.
7
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
8
Pooling annotated corpora for clinical concept extraction.合并带注释语料库用于临床概念提取。
J Biomed Semantics. 2013 Jan 8;4(1):3. doi: 10.1186/2041-1480-4-3.
9
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
10
Comparison of Three English-to-Dutch Machine Translations of SNOMED CT Procedures.SNOMED CT 程序的三种英荷机器翻译比较。
Stud Health Technol Inform. 2017;245:848-852.

引用本文的文献

1
Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study.使用结构化编码和自由文本注释来衡量电子健康记录中的信息互补性:可行性与验证研究。
J Med Internet Res. 2025 Feb 13;27:e66910. doi: 10.2196/66910.

本文引用的文献

1
BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights.BioLORD-2023:融合大型语言模型和临床知识图谱洞察的语义文本表示。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1844-1855. doi: 10.1093/jamia/ocae029.
2
OHDSI Standardized Vocabularies-a large-scale centralized reference ontology for international data harmonization.OHDSI 标准化词汇表-用于国际数据协调的大规模集中参考本体。
J Am Med Inform Assoc. 2024 Feb 16;31(3):583-590. doi: 10.1093/jamia/ocad247.
3
GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment.
GERNERMED++:通过迁移学习、翻译和词对齐实现德语医学自然语言处理中的语义标注。
J Biomed Inform. 2023 Nov;147:104513. doi: 10.1016/j.jbi.2023.104513. Epub 2023 Oct 13.
4
Supporting Pharmacovigilance Signal Validation and Prioritization with Analyses of Routinely Collected Health Data: Lessons Learned from an EHDEN Network Study.利用常规健康数据的分析支持药物警戒信号验证和优先级排序:EHDEN 网络研究的经验教训。
Drug Saf. 2023 Dec;46(12):1335-1352. doi: 10.1007/s40264-023-01353-w. Epub 2023 Oct 7.
5
Annotated dataset creation through large language models for non-english medical NLP.通过大型语言模型创建非英语医学自然语言处理的标注数据集。
J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.
6
The added value of text from Dutch general practitioner notes in predictive modeling.荷兰全科医生记录中文本在预测建模中的附加价值。
J Am Med Inform Assoc. 2023 Nov 17;30(12):1973-1984. doi: 10.1093/jamia/ocad160.
7
Clinical named entity recognition and relation extraction using natural language processing of medical free text: A systematic review.临床命名实体识别和关系抽取技术在医学自然语言处理中的应用:系统综述。
Int J Med Inform. 2023 Sep;177:105122. doi: 10.1016/j.ijmedinf.2023.105122. Epub 2023 Jun 5.
8
Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods.荷兰临床文本中的否定词检测:基于规则和机器学习方法的评估。
BMC Bioinformatics. 2023 Jan 9;24(1):10. doi: 10.1186/s12859-022-05130-x.
9
From real-world electronic health record data to real-world results using artificial intelligence.从真实世界的电子健康记录数据到使用人工智能获得真实世界的结果。
Ann Rheum Dis. 2023 Mar;82(3):306-311. doi: 10.1136/ard-2022-222626. Epub 2022 Sep 23.
10
Use of unstructured text in prognostic clinical prediction models: a systematic review.使用非结构化文本进行预后临床预测模型:系统评价。
J Am Med Inform Assoc. 2022 Jun 14;29(7):1292-1302. doi: 10.1093/jamia/ocac058.