Suppr超能文献

多领域临床自然语言处理与 MedCAT:医学概念标注工具包。

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit.

机构信息

Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.

Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.

出版信息

Artif Intell Med. 2021 Jul;117:102083. doi: 10.1016/j.artmed.2021.102083. Epub 2021 May 1.

Abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

摘要

电子健康记录 (EHR) 包含大量非结构化文本,需要应用信息提取 (IE) 技术来实现临床分析。我们介绍了开源的医学概念标注工具包 (MedCAT),它提供了:(a) 一种新颖的基于自我监督的机器学习算法,用于使用任何概念词汇(包括 UMLS/SNOMED-CT)提取概念;(b) 一个功能丰富的标注界面,用于定制和训练 IE 模型;以及 (c) 与更广泛的 CogStack 生态系统的集成,用于实现与供应商无关的健康系统部署。我们在从开放数据集提取 UMLS 概念方面展示了性能的提升 (F1:0.448-0.738 与 0.429-0.650)。进一步的实际验证表明,在 3 家伦敦大医院中使用自我监督训练从约 1700 万份临床记录中的约 88 亿个单词中提取 SNOMED-CT,然后使用约 6000 个临床医生标注的示例进行进一步微调。我们在医院、数据集和概念类型之间展示了很强的可转移性 (F1>0.94),表明跨领域的 EHR 通用工具对于加速临床和研究用例具有应用价值。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验