• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SemClinBr - 一个用于葡萄牙语临床自然语言处理任务的多机构和多专业的语义注释语料库。

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.

机构信息

Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil.

AI Lab, Philips Research North America, Cambridge, MA, USA.

出版信息

J Biomed Semantics. 2022 May 8;13(1):13. doi: 10.1186/s13326-022-00269-1.

DOI:10.1186/s13326-022-00269-1
PMID:35527259
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9080187/
Abstract

BACKGROUND

The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.

METHODS

In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.

RESULTS

This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.

CONCLUSION

The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.

摘要

背景

大量研究致力于从电子健康记录 (EHR) 中提取患者信息,这导致对标注语料库的需求增加,标注语料库是自然语言处理 (NLP) 算法的开发和评估的宝贵资源。除了英语范围之外,特别是在巴西葡萄牙语中,缺乏多用途的临床语料库,这是显而易见的,严重影响了生物医学 NLP 领域的科学进展。

方法

本研究使用来自多个医学专业、文档类型和机构的临床文本开发了一个语义标注语料库。此外,我们还展示了:(1) 一份列出先前研究的常见方面、差异和经验教训的调查;(2) 可复制的细粒度标注方案,以指导其他标注计划;(3) 一个专注于标注建议功能的基于网络的标注工具;以及 (4) 标注的内在和外在评估。

结果

本研究产生了 SemClinBr,这是一个包含 1000 个临床笔记、标注了 65117 个实体和 11263 个关系的语料库。此外,还从标注中生成了否定提示和医学缩写词典。平均注释者一致性评分从 0.71(应用严格匹配)到 0.92(考虑宽松匹配)不等,同时接受部分重叠和层次相关的语义类型。在将语料库应用于两个下游 NLP 任务的外部评估中,证明了标注的可靠性和有用性,系统的结果与一致性评分一致。

结论

SemClinBr 语料库和本工作中生成的其他资源可以支持临床 NLP 研究,为研究社区提供一个共同的开发和评估资源,促进 EHR 在临床实践和生物医学研究中的利用。据我们所知,SemClinBr 是第一个可用的葡萄牙语临床语料库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/7e11d360c466/13326_2022_269_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/e65bb0bce7f0/13326_2022_269_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/7eeb3f7a32bc/13326_2022_269_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/1c6dbb80b988/13326_2022_269_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/7e11d360c466/13326_2022_269_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/e65bb0bce7f0/13326_2022_269_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/7eeb3f7a32bc/13326_2022_269_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/1c6dbb80b988/13326_2022_269_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f897/9080187/7e11d360c466/13326_2022_269_Fig4_HTML.jpg

相似文献

1
SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.SemClinBr - 一个用于葡萄牙语临床自然语言处理任务的多机构和多专业的语义注释语料库。
J Biomed Semantics. 2022 May 8;13(1):13. doi: 10.1186/s13326-022-00269-1.
2
Towards comprehensive syntactic and semantic annotations of the clinical narrative.朝着临床叙述的全面句法和语义标注努力。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):922-30. doi: 10.1136/amiajnl-2012-001317. Epub 2013 Jan 25.
3
Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing.基于Web 2.0的众包方式用于临床自然语言处理中高质量金标准的开发。
J Med Internet Res. 2013 Apr 2;15(4):e73. doi: 10.2196/jmir.2426.
4
Temporal information extraction from mental health records to identify duration of untreated psychosis.从心理健康记录中提取时间信息,以确定未治疗精神病的持续时间。
J Biomed Semantics. 2020 Mar 10;11(1):2. doi: 10.1186/s13326-020-00220-2.
5
Standardizing Heterogeneous Annotation Corpora Using HL7 FHIR for Facilitating their Reuse and Integration in Clinical NLP.使用HL7 FHIR对异构注释语料库进行标准化,以促进其在临床自然语言处理中的重用和整合。
AMIA Annu Symp Proc. 2018 Dec 5;2018:574-583. eCollection 2018.
6
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
7
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.构建中文临床文本的综合句法和语义语料库。
J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.
8
A Five-Step Workflow to Manually Annotate Unstructured Data into Training Dataset for Natural Language Processing.将非结构化数据手动注释到自然语言处理训练数据集中的五步工作流程。
Stud Health Technol Inform. 2024 Jan 25;310:109-113. doi: 10.3233/SHTI230937.
9
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
10
Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task.学习葡萄牙语临床词嵌入:一个支持下游生物医学任务的多专业、多机构临床叙事语料库。
Stud Health Technol Inform. 2019 Aug 21;264:123-127. doi: 10.3233/SHTI190196.

引用本文的文献

1
Artificial intelligence for detecting anaphylaxis in electronic medical records.用于在电子病历中检测过敏反应的人工智能
Asia Pac Allergy. 2025 Sep;15(3):153-158. doi: 10.5415/apallergy.0000000000000179. Epub 2025 Jan 8.
2
Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop.基于有限标注的英文和日文病例/放射学报告的跨语言自然语言处理:来自Real-MedNLP研讨会的见解。
Methods Inf Med. 2024 Oct 29. doi: 10.1055/a-2405-2489.
3
Disambiguation of acronyms in clinical narratives with large language models.

本文引用的文献

1
The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records.注释字段的复兴:利用电子健康记录中的非结构化内容
Front Med (Lausanne). 2019 Apr 17;6:66. doi: 10.3389/fmed.2019.00066. eCollection 2019.
2
Clinical Natural Language Processing in languages other than English: opportunities and challenges.非英语语言的临床自然语言处理:机遇与挑战。
J Biomed Semantics. 2018 Mar 30;9(1):12. doi: 10.1186/s13326-018-0179-8.
3
CUILESS2016: a clinical corpus applying compositional normalization of text mentions.
利用大型语言模型对临床叙述中的缩略语进行消歧。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2040-2046. doi: 10.1093/jamia/ocae157.
4
Year 2022 in Medical Natural Language Processing: Availability of Language Models as a Step in the Democratization of NLP in the Biomedical Area.2022 年医学自然语言处理:语言模型的可用性是生物医学领域 NLP 民主化的一步。
Yearb Med Inform. 2023 Aug;32(1):244-252. doi: 10.1055/s-0043-1768752. Epub 2023 Dec 26.
5
Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.探索多语言医学自然语言处理的最新亮点:综述。
Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26.
CUILESS2016:一个应用文本提及成分归一化的临床语料库。
J Biomed Semantics. 2018 Jan 10;9(1):2. doi: 10.1186/s13326-017-0173-6.
4
Temporal Annotation in the Clinical Domain.临床领域中的时间标注
Trans Assoc Comput Linguist. 2014 Apr;2:143-154.
5
Semantic annotation in biomedicine: the current landscape.生物医学中的语义标注:现状
J Biomed Semantics. 2017 Sep 22;8(1):44. doi: 10.1186/s13326-017-0153-x.
6
Design of an extensive information representation scheme for clinical narratives.临床叙述的广泛信息表示方案设计
J Biomed Semantics. 2017 Sep 11;8(1):37. doi: 10.1186/s13326-017-0135-z.
7
Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.用于去识别化的纵向临床记录标注:2014年i2b2/德克萨斯大学健康科学中心语料库
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.
8
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.用于纵向临床记录去识别化的自动化系统:2014年i2b2/德克萨斯大学健康科学中心共享任务赛道1概述
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.
9
Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2.随着时间推移识别心脏病的风险因素:2014年i2b2/德克萨斯大学健康科学中心共享任务第2轨道概述
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S67-S77. doi: 10.1016/j.jbi.2015.07.001. Epub 2015 Jul 22.
10
On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.关于创建西班牙语临床金标准语料库:挖掘药物不良反应
J Biomed Inform. 2015 Aug;56:318-32. doi: 10.1016/j.jbi.2015.06.016. Epub 2015 Jun 30.