• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

应对电子健康记录监督式机器学习训练中标记数据不足的策略:从临床笔记中提取症状的案例研究

Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes.

作者信息

Humbert-Droz Marie, Mukherjee Pritam, Gevaert Olivier

机构信息

Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA, United States.

Department of Biomedical Data Science, Stanford University, Stanford, CA, United States.

出版信息

JMIR Med Inform. 2022 Mar 14;10(3):e32903. doi: 10.2196/32903.

DOI:10.2196/32903
PMID:35285805
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8961340/
Abstract

BACKGROUND

Automated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health information. Natural language processing and machine learning to process clinical text for such a task have great potential. However, supervised machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck in model development.

OBJECTIVE

The aim of this study is to address the lack of labeled data by proposing 2 alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using lower-quality labels for training leads to good classification results.

METHODS

We addressed the lack of labels with 2 strategies. The first approach took advantage of the structured part of electronic health records and used diagnosis codes (International Classification of Disease-10th revision) to derive training labels. The second approach used weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptom information from outpatient visit progress notes of patients with cardiovascular diseases.

RESULTS

We used >500,000 notes for training our classification model with International Classification of Disease-10th revision codes as labels and >800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (>500,000 documents). We further demonstrate that using weak labels for training rather than the electronic health record codes derived from the patient encounter leads to an overall improved recall score (10% improvement, on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% in the recall score.

CONCLUSIONS

This work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for a downstream clinical task such as clinical decision support.

摘要

背景

由于症状描述具有多维度性质,从临床记录中自动提取症状是一项具有挑战性的任务。由于数据包含受保护的健康信息,带标签的训练数据极其有限。利用自然语言处理和机器学习来处理此类任务的临床文本具有巨大潜力。然而,监督式机器学习需要大量带标签的数据来训练模型,这是模型开发主要瓶颈的根源。

目的

本研究旨在通过为使用英文临床文本进行监督式机器学习生成训练标签,提出两种替代人工标注的方法,以解决带标签数据的不足问题。我们旨在证明使用质量较低的标签进行训练能产生良好的分类结果。

方法

我们用两种策略解决标签不足的问题。第一种方法利用电子健康记录的结构化部分,使用诊断代码(国际疾病分类第十版)来推导训练标签。第二种方法使用弱监督和数据编程原则来推导训练标签。我们建议将所开发的框架应用于从心血管疾病患者的门诊就诊病程记录中提取症状信息。

结果

我们使用超过500,000份记录,以国际疾病分类第十版代码作为标签来训练我们的分类模型,并使用超过800,000份记录来训练基于弱监督推导的标签。我们表明,只要使用足够大的训练集(>500,000份文档),患病率与召回率之间的相关性就会趋于平稳。我们进一步证明,使用弱标签进行训练而非从患者就诊中得出的电子健康记录代码,能使召回分数总体提高(平均提高10%)。最后,我们模型的外部验证显示出优异的预测性能和可转移性,召回分数总体提高了20%。

结论

这项工作展示了使用弱标注流程来注释和提取临床文本中症状提及的作用,有望促进症状信息整合,用于诸如临床决策支持等下游临床任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/65da135f4454/medinform_v10i3e32903_fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/3d651e43989a/medinform_v10i3e32903_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/ff6c509b3f58/medinform_v10i3e32903_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/7742f285b680/medinform_v10i3e32903_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/b5214a0ad210/medinform_v10i3e32903_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/f916e5303c58/medinform_v10i3e32903_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/6df2ff9220f2/medinform_v10i3e32903_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/b742dff1d866/medinform_v10i3e32903_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/be31f3fc23d9/medinform_v10i3e32903_fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/b2034ca4d6dc/medinform_v10i3e32903_fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/c878fb6f1d5b/medinform_v10i3e32903_fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/65da135f4454/medinform_v10i3e32903_fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/3d651e43989a/medinform_v10i3e32903_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/ff6c509b3f58/medinform_v10i3e32903_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/7742f285b680/medinform_v10i3e32903_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/b5214a0ad210/medinform_v10i3e32903_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/f916e5303c58/medinform_v10i3e32903_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/6df2ff9220f2/medinform_v10i3e32903_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/b742dff1d866/medinform_v10i3e32903_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/be31f3fc23d9/medinform_v10i3e32903_fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/b2034ca4d6dc/medinform_v10i3e32903_fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/c878fb6f1d5b/medinform_v10i3e32903_fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc36/8961340/65da135f4454/medinform_v10i3e32903_fig11.jpg

相似文献

1
Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes.应对电子健康记录监督式机器学习训练中标记数据不足的策略:从临床笔记中提取症状的案例研究
JMIR Med Inform. 2022 Mar 14;10(3):e32903. doi: 10.2196/32903.
2
Clinical Text Data in Machine Learning: Systematic Review.机器学习中的临床文本数据:系统综述
JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.
3
A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
4
Ontology-driven and weakly supervised rare disease identification from clinical notes.基于本体的临床笔记辅助下的弱监督罕见病识别。
BMC Med Inform Decis Mak. 2023 May 5;23(1):86. doi: 10.1186/s12911-023-02181-9.
5
Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。
J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.
6
Development and validation of MedDRA Tagger: a tool for extraction and structuring medical information from clinical notes.医学术语集标注工具的开发与验证:一种从临床记录中提取和构建医学信息的工具。
medRxiv. 2022 Dec 14:2022.12.14.22283470. doi: 10.1101/2022.12.14.22283470.
7
Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation.利用弱监督和深度学习对临床记录进行分类,以识别当前的自杀意念。
J Psychiatr Res. 2021 Apr;136:95-102. doi: 10.1016/j.jpsychires.2021.01.052. Epub 2021 Feb 2.
8
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
9
Using clinical text to refine unspecific condition codes in Dutch general practitioner EHR data.利用临床文本完善荷兰全科医生电子健康记录数据中不明确的病症编码。
Int J Med Inform. 2024 Sep;189:105506. doi: 10.1016/j.ijmedinf.2024.105506. Epub 2024 May 29.
10
Classifying the lifestyle status for Alzheimer's disease from clinical notes using deep learning with weak supervision.使用基于弱监督的深度学习对临床笔记进行阿尔茨海默病生活方式状况分类。
BMC Med Inform Decis Mak. 2022 Jul 7;22(Suppl 1):88. doi: 10.1186/s12911-022-01819-4.

引用本文的文献

1
Integrating snapshot ensemble learning into masked autoencoders for efficient self-supervised pretraining in medical imaging.将快照集成学习集成到掩码自动编码器中,以实现医学成像中的高效自监督预训练。
Sci Rep. 2025 Aug 25;15(1):31232. doi: 10.1038/s41598-025-15704-3.
2
Synthetic4Health: generating annotated synthetic clinical letters.合成4健康:生成带注释的合成临床信件。
Front Digit Health. 2025 May 30;7:1497130. doi: 10.3389/fdgth.2025.1497130. eCollection 2025.
3
Extraction of Normalized Symptom Mentions From Clinical Narratives Using Large Language Models.

本文引用的文献

1
A Computational Framework to Analyze the Associations Between Symptoms and Cancer Patient Attributes Post Chemotherapy Using EHR Data.一种利用电子健康记录(EHR)数据分析化疗后症状与癌症患者属性之间关联的计算框架。
IEEE J Biomed Health Inform. 2021 Nov;25(11):4098-4109. doi: 10.1109/JBHI.2021.3117238. Epub 2021 Nov 5.
2
Limitations of Transformers on Clinical Text Classification.Transformer 在临床文本分类上的局限性。
IEEE J Biomed Health Inform. 2021 Sep;25(9):3596-3607. doi: 10.1109/JBHI.2021.3062322. Epub 2021 Sep 3.
3
Assessing the accuracy of ICD-10 coding for measuring rates of and mortality from acute kidney injury and the impact of electronic alerts: an observational cohort study.
使用大语言模型从临床叙述中提取标准化症状提及
AMIA Annu Symp Proc. 2025 May 22;2024:600-609. eCollection 2024.
4
Neoplasms in the Nasal Cavity Identified and Tracked with an Artificial Intelligence-Assisted Nasal Endoscopic Diagnostic System.利用人工智能辅助鼻内镜诊断系统识别和追踪鼻腔肿瘤
Bioengineering (Basel). 2024 Dec 25;12(1):10. doi: 10.3390/bioengineering12010010.
5
Statistical Inference for Maximin Effects: Identifying Stable Associations across Multiple Studies.最大最小效应的统计推断:识别多项研究中的稳定关联。
J Am Stat Assoc. 2024;119(547):1968-1984. doi: 10.1080/01621459.2023.2233162. Epub 2023 Aug 4.
6
Automated Identification of Postoperative Infections to Allow Prediction and Surveillance Based on Electronic Health Record Data: Scoping Review.基于电子健康记录数据实现术后感染的自动识别以进行预测和监测:范围综述
JMIR Med Inform. 2024 Sep 10;12:e57195. doi: 10.2196/57195.
7
Improving Automating Quality Control in Radiology: Leveraging Large Language Models to Extract Correlative Findings in Radiology and Operative Reports.提高放射学中的自动化质量控制:利用大语言模型提取放射学报告和手术报告中的相关发现。
AMIA Jt Summits Transl Sci Proc. 2024 May 31;2024:135-144. eCollection 2024.
8
A self-supervised framework for cross-modal search in histopathology archives using scale harmonization.一种使用尺度协调在组织病理学档案中进行跨模态搜索的自监督框架。
Sci Rep. 2024 Apr 27;14(1):9724. doi: 10.1038/s41598-024-60256-7.
9
Identifying Risk Factors Associated With Lower Back Pain in Electronic Medical Record Free Text: Deep Learning Approach Using Clinical Note Annotations.在电子病历自由文本中识别与下背痛相关的风险因素:使用临床记录注释的深度学习方法
JMIR Med Inform. 2023 Aug 9;11:e45105. doi: 10.2196/45105.
10
Weakly supervised spatial relation extraction from radiology reports.从放射学报告中进行弱监督空间关系提取。
JAMIA Open. 2023 Apr 22;6(2):ooad027. doi: 10.1093/jamiaopen/ooad027. eCollection 2023 Jul.
评估国际疾病分类第十版(ICD-10)编码在测量急性肾损伤发病率和死亡率方面的准确性以及电子警报的影响:一项观察性队列研究。
Clin Kidney J. 2019 Oct 19;13(6):1083-1090. doi: 10.1093/ckj/sfz117. eCollection 2020 Dec.
4
Identifying Symptom Information in Clinical Notes Using Natural Language Processing.利用自然语言处理技术识别临床记录中的症状信息。
Nurs Res. 2021;70(3):173-183. doi: 10.1097/NNR.0000000000000488.
5
Comparison of International Classification of Diseases and Related Health Problems, Tenth Revision Codes With Electronic Medical Records Among Patients With Symptoms of Coronavirus Disease 2019.国际疾病分类与相关健康问题第十版与电子病历在以冠状病毒病 2019 症状就诊患者中的比较。
JAMA Netw Open. 2020 Aug 3;3(8):e2017703. doi: 10.1001/jamanetworkopen.2020.17703.
6
Cross-Modal Data Programming Enables Rapid Medical Machine Learning.跨模态数据编程助力快速医学机器学习。
Patterns (N Y). 2020 May 8;1(2). doi: 10.1016/j.patter.2020.100019. Epub 2020 Apr 28.
7
Augmented curation of clinical notes from a massive EHR system reveals symptoms of impending COVID-19 diagnosis.从庞大的电子健康记录系统中增强对临床记录的注释可揭示即将出现 COVID-19 诊断的症状。
Elife. 2020 Jul 7;9:e58227. doi: 10.7554/eLife.58227.
8
The Inaccuracy of ICD-10 Coding in Revision Total Hip Arthroplasty and Its Implication on Revision Data.翻修全髋关节置换术中ICD - 10编码的不准确及其对翻修数据的影响
J Arthroplasty. 2020 Oct;35(10):2960-2965.e3. doi: 10.1016/j.arth.2020.05.013. Epub 2020 May 12.
9
Clinical Text Data in Machine Learning: Systematic Review.机器学习中的临床文本数据:系统综述
JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.
10
Snorkel: rapid training data creation with weak supervision.Snorkel:通过弱监督快速创建训练数据。
VLDB J. 2020;29(2):709-730. doi: 10.1007/s00778-019-00552-1. Epub 2019 Jul 15.