• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多领域临床自然语言处理与 MedCAT:医学概念标注工具包。

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit.

机构信息

Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.

Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.

出版信息

Artif Intell Med. 2021 Jul;117:102083. doi: 10.1016/j.artmed.2021.102083. Epub 2021 May 1.

DOI:10.1016/j.artmed.2021.102083
PMID:34127232
Abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

摘要

电子健康记录 (EHR) 包含大量非结构化文本,需要应用信息提取 (IE) 技术来实现临床分析。我们介绍了开源的医学概念标注工具包 (MedCAT),它提供了:(a) 一种新颖的基于自我监督的机器学习算法,用于使用任何概念词汇(包括 UMLS/SNOMED-CT)提取概念;(b) 一个功能丰富的标注界面,用于定制和训练 IE 模型;以及 (c) 与更广泛的 CogStack 生态系统的集成,用于实现与供应商无关的健康系统部署。我们在从开放数据集提取 UMLS 概念方面展示了性能的提升 (F1:0.448-0.738 与 0.429-0.650)。进一步的实际验证表明,在 3 家伦敦大医院中使用自我监督训练从约 1700 万份临床记录中的约 88 亿个单词中提取 SNOMED-CT,然后使用约 6000 个临床医生标注的示例进行进一步微调。我们在医院、数据集和概念类型之间展示了很强的可转移性 (F1>0.94),表明跨领域的 EHR 通用工具对于加速临床和研究用例具有应用价值。

相似文献

1
Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit.多领域临床自然语言处理与 MedCAT:医学概念标注工具包。
Artif Intell Med. 2021 Jul;117:102083. doi: 10.1016/j.artmed.2021.102083. Epub 2021 May 1.
2
Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.开发和评估 RapTAT:一种用于从医学叙述中映射短语概念的机器学习系统。
J Biomed Inform. 2014 Apr;48:54-65. doi: 10.1016/j.jbi.2013.11.008. Epub 2013 Dec 4.
3
Natural Language Processing to Extract Head and Neck Cancer Data From Unstructured Electronic Health Records.从非结构化电子健康记录中提取头颈癌数据的自然语言处理
Clin Oncol (R Coll Radiol). 2025 May;41:103805. doi: 10.1016/j.clon.2025.103805. Epub 2025 Mar 20.
4
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
5
Extraction of UMLS® Concepts Using Apache cTAKES™ for German Language.使用Apache cTAKES™从德语中提取统一医学语言系统(UMLS®)概念。
Stud Health Technol Inform. 2016;223:71-6.
6
Using a statistical natural language Parser augmented with the UMLS specialist lexicon to assign SNOMED CT codes to anatomic sites and pathologic diagnoses in full text pathology reports.使用一个通过统一医学语言系统(UMLS)专业词典增强的统计自然语言解析器,为全文病理报告中的解剖部位和病理诊断分配SNOMED CT编码。
AMIA Annu Symp Proc. 2009 Nov 14;2009:386-90.
7
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets.医学概念规范化中的歧义:电子健康记录数据集的类型和覆盖范围分析。
J Am Med Inform Assoc. 2021 Mar 1;28(3):516-532. doi: 10.1093/jamia/ocaa269.
8
Unified Medical Language System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)-based ranking for concept normalization.统一医学语言系统资源提高了基于筛子的生成和基于双向编码器表示的转换器(BERT)的排名,以实现概念归一化。
J Am Med Inform Assoc. 2020 Oct 1;27(10):1510-1519. doi: 10.1093/jamia/ocaa080.
9
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
10
Mining cross-terminology links in the UMLS.挖掘统一医学语言系统中的跨术语链接。
AMIA Annu Symp Proc. 2006;2006:624-8.

引用本文的文献

1
AI assisted prediction of unplanned intensive care admissions using natural language processing in elective neurosurgery.利用自然语言处理技术在择期神经外科手术中进行人工智能辅助预测非计划重症监护病房入院情况
NPJ Digit Med. 2025 Aug 27;8(1):549. doi: 10.1038/s41746-025-01952-0.
2
Exploring the consistency, quality and challenges in manual and automated coding of free-text diagnoses from hospital outpatient letters.探索医院门诊信件中自由文本诊断的人工编码和自动编码的一致性、质量及挑战。
PLoS One. 2025 Aug 25;20(8):e0328108. doi: 10.1371/journal.pone.0328108. eCollection 2025.
3
SNOMED CT entity linking challenge.
SNOMED CT实体链接挑战赛。
J Am Med Inform Assoc. 2025 Sep 1;32(9):1397-1406. doi: 10.1093/jamia/ocaf104.
4
Racial and ethnic disparities in aortic stenosis within a universal healthcare system characterized by natural language processing for targeted intervention.在一个以自然语言处理进行靶向干预为特征的全民医疗体系中,主动脉瓣狭窄的种族和民族差异。
Eur Heart J Digit Health. 2025 Mar 18;6(3):392-403. doi: 10.1093/ehjdh/ztaf018. eCollection 2025 May.
5
Enhanced effective convolutional attention network with squeeze-and-excitation inception module for multi-label clinical document classification.基于挤压激励 inception 模块的增强型有效卷积注意力网络用于多标签临床文档分类
Sci Rep. 2025 May 16;15(1):16988. doi: 10.1038/s41598-025-98719-0.
6
Dual-stream algorithms for dementia detection: Harnessing structured and unstructured electronic health record data, a novel approach to prevalence estimation.用于痴呆症检测的双流算法:利用结构化和非结构化电子健康记录数据,一种估计患病率的新方法。
Alzheimers Dement. 2025 May;21(5):e70132. doi: 10.1002/alz.70132.
7
I-SIRch: AI-powered concept annotation tool for equitable extraction and analysis of safety insights from maternity investigations.I-SIRch:用于公平提取和分析孕产妇调查安全见解的人工智能概念注释工具。
Int J Popul Data Sci. 2024 Nov 20;9(2):2439. doi: 10.23889/ijpds.v9i2.2439. eCollection 2024.
8
Large Language Model-Based Critical Care Big Data Deployment and Extraction: Descriptive Analysis.基于大语言模型的重症监护大数据部署与提取:描述性分析
JMIR Med Inform. 2025 Mar 12;13:e63216. doi: 10.2196/63216.
9
Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification.使用跨度和文档级特征分类从非结构化荷兰语超声心动图报告中提取诊断信息。
BMC Med Inform Decis Mak. 2025 Mar 7;25(1):115. doi: 10.1186/s12911-025-02897-w.
10
Utilizing GPT-4 to interpret oral mucosal disease photographs for structured report generation.利用GPT-4解读口腔黏膜疾病照片以生成结构化报告。
Sci Rep. 2025 Feb 12;15(1):5187. doi: 10.1038/s41598-025-89328-y.