• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于生物医学概念识别的多语言金标准语料库:Mantra GSC。

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.

作者信息

Kors Jan A, Clematide Simon, Akhondi Saber A, van Mulligen Erik M, Rebholz-Schuhmann Dietrich

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands

Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland.

出版信息

J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.

DOI:10.1093/jamia/ocv037
PMID:25948699
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4986661/
Abstract

OBJECTIVE

To create a multilingual gold-standard corpus for biomedical concept recognition.

MATERIALS AND METHODS

We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations.

RESULTS

The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language.

DISCUSSION

The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques.

CONCLUSION

To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.

摘要

目的

创建一个用于生物医学概念识别的多语言金标准语料库。

材料与方法

我们从英文、法文、德文、西班牙文和荷兰文的不同平行语料库(Medline摘要标题、药品标签、生物医学专利声明)中选取文本单元。每种语言由三名注释者基于统一医学语言系统的一个子集并涵盖广泛语义组独立注释生物医学概念。为减少注释工作量,提供了自动生成的预注释。对个体注释进行自动协调然后裁决,并进行跨语言一致性检查以得出最终注释。

结果

最终注释的数量为5530条。注释者间的一致性得分表明一致性良好(中位数F值为0.79),且与个体注释者和金标准之间的得分相似。为每种语言自动生成的协调注释集与该语言最佳注释者的表现相当。

讨论

自动预注释、协调注释和平行语料库的使用有助于使人工注释工作可控。注释者间的一致性得分提供了衡量自动注释技术性能的参考标准。

结论

据我们所知,这是首个除英语外用于生物医学概念识别的金标准语料库。其他显著特点是涵盖的语义组种类广泛以及注释的文本体裁多样。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/0e7ee37a1262/ocv037f6p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/ec20f5ffd1e2/ocv037f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/11aee5103794/ocv037f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/c129af5dc504/ocv037f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/c75a6ca0de25/ocv037f4p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/22a8df3f6a25/ocv037f5p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/0e7ee37a1262/ocv037f6p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/ec20f5ffd1e2/ocv037f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/11aee5103794/ocv037f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/c129af5dc504/ocv037f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/c75a6ca0de25/ocv037f4p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/22a8df3f6a25/ocv037f5p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/0e7ee37a1262/ocv037f6p.jpg

相似文献

1
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.用于生物医学概念识别的多语言金标准语料库:Mantra GSC。
J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.
2
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
3
SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes.SIFR 标注器:基于本体论的法语生物医学文本和临床笔记的语义标注。
BMC Bioinformatics. 2018 Nov 6;19(1):405. doi: 10.1186/s12859-018-2429-2.
4
RysannMD: A biomedical semantic annotator balancing speed and accuracy.RysannMD:一款兼顾速度与准确性的生物医学语义注释工具。
J Biomed Inform. 2017 Jul;71:91-109. doi: 10.1016/j.jbi.2017.05.016. Epub 2017 May 26.
5
Automatic lexeme acquisition for a multilingual medical subword thesaurus.用于多语言医学子词词典的自动词元获取。
Int J Med Inform. 2007 Feb-Mar;76(2-3):184-9. doi: 10.1016/j.ijmedinf.2006.05.032. Epub 2006 Jul 12.
6
Quantitative analysis of manual annotation of clinical text samples.临床文本样本的人工标注定量分析。
Int J Med Inform. 2019 Mar;123:37-48. doi: 10.1016/j.ijmedinf.2018.12.011. Epub 2018 Dec 31.
7
Automatic lexicon acquisition for a medical cross-language information retrieval system.用于医学跨语言信息检索系统的自动词汇获取
Stud Health Technol Inform. 2005;116:829-34.
8
Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction.PubMed 查询的半自动语义标注:一项关于质量、效率和满意度的研究。
J Biomed Inform. 2011 Apr;44(2):310-8. doi: 10.1016/j.jbi.2010.11.001. Epub 2010 Nov 20.
9
Defining and relating biomedical terms: towards a cross-language morphosemantics-based system.定义生物医学术语并建立其关联:迈向基于跨语言形态语义学的系统。
Int J Med Inform. 2007 Feb-Mar;76(2-3):226-33. doi: 10.1016/j.ijmedinf.2006.05.001. Epub 2006 Jun 30.
10
Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval.多语言医学术语的自动处理:在叙词表扩充和跨语言信息检索中的应用
Artif Intell Med. 2005 Feb;33(2):111-24. doi: 10.1016/j.artmed.2004.07.015.

引用本文的文献

1
Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.临床文档语料库——真实语料库、翻译语料库和合成替代语料库,以及各类领域替代语料库:语料库设计多样性调查,重点关注德语文本数据
JAMIA Open. 2025 May 14;8(3):ooaf024. doi: 10.1093/jamiaopen/ooaf024. eCollection 2025 Jun.
2
xMEN: a modular toolkit for cross-lingual medical entity normalization.xMEN:用于跨语言医学实体规范化的模块化工具包。
JAMIA Open. 2024 Dec 26;8(1):ooae147. doi: 10.1093/jamiaopen/ooae147. eCollection 2025 Feb.
3
Improving biomedical entity linking for complex entity mentions with LLM-based text simplification.

本文引用的文献

1
Evaluating the state of the art in disorder recognition and normalization of the clinical narrative.评估临床病历中疾病识别和规范化的当前技术水平。
J Am Med Inform Assoc. 2015 Jan;22(1):143-54. doi: 10.1136/amiajnl-2013-002544. Epub 2014 Aug 21.
2
NIH's Big Data to Knowledge initiative and the advancement of biomedical informatics.美国国立卫生研究院的“大数据到知识”计划与生物医学信息学的发展。
J Am Med Inform Assoc. 2014 Mar-Apr;21(2):193. doi: 10.1136/amiajnl-2014-002666.
3
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
基于大语言模型的文本简化技术提升复杂实体提及的生物医学实体链接
Database (Oxford). 2024 Jul 26;2024. doi: 10.1093/database/baae067.
4
Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.利用标注保留的机器翻译将英文语料库翻译为荷兰文,以验证荷兰临床概念提取工具。
J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.
5
Impact of Translation on Biomedical Information Extraction: Experiment on Real-Life Clinical Notes.翻译对生物医学信息提取的影响:基于实际临床记录的实验
JMIR Med Inform. 2024 Apr 4;12:e49607. doi: 10.2196/49607.
6
Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer.癌症变异组:一个通过文献挖掘得到的、关于癌症中基因改变所引发调控事件的资源。
Sci Data. 2024 Mar 2;11(1):265. doi: 10.1038/s41597-024-03083-9.
7
DrNote: An open medical annotation service.DrNote:一项开放的医学注释服务。
PLOS Digit Health. 2022 Aug 15;1(8):e0000086. doi: 10.1371/journal.pdig.0000086. eCollection 2022 Aug.
8
An overview of biomedical entity linking throughout the years.生物医学实体链接概述。
J Biomed Inform. 2023 Jan;137:104252. doi: 10.1016/j.jbi.2022.104252. Epub 2022 Dec 2.
9
MedTAG: a portable and customizable annotation tool for biomedical documents.MedTAG:一个用于生物医学文档的可移植和可定制的注释工具。
BMC Med Inform Decis Mak. 2021 Dec 18;21(1):352. doi: 10.1186/s12911-021-01706-4.
10
A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine.一个用统一医学语言系统(UMLS)实体注释的临床试验语料库,以加强对循证医学的获取。
BMC Med Inform Decis Mak. 2021 Feb 22;21(1):69. doi: 10.1186/s12911-021-01395-z.
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
4
Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.开发一个基准语料库,以支持从医疗病例报告中自动提取与药物相关的不良反应。
J Biomed Inform. 2012 Oct;45(5):885-92. doi: 10.1016/j.jbi.2012.04.008. Epub 2012 Apr 25.
5
Assessment of NER solutions against the first and second CALBC Silver Standard Corpus.针对首个和第二个CALBC银标准语料库对命名实体识别解决方案进行评估。
J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S11. doi: 10.1186/2041-1480-2-S5-S11.
6
The gene normalization task in BioCreative III.BioCreative III 中的基因标准化任务。
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S2. doi: 10.1186/1471-2105-12-S8-S2.
7
CALBC silver standard corpus.CALBC银标准语料库。
J Bioinform Comput Biol. 2010 Feb;8(1):163-79. doi: 10.1142/s0219720010004562.
8
Overview of BioCreative II gene normalization.生物创意II基因标准化概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.
9
Text processing through Web services: calling Whatizit.通过网络服务进行文本处理:调用Whatizit。
Bioinformatics. 2008 Jan 15;24(2):296-8. doi: 10.1093/bioinformatics/btm557. Epub 2007 Nov 15.
10
Overview of BioCreAtIvE task 1B: normalized gene lists.生物创意任务1B概述:标准化基因列表。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.