• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过上下文学习在电子健康记录中高效检测污名化语言:比较分析与验证研究

Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: Comparative Analysis and Validation Study.

作者信息

Chen Hongbo, Alfred Myrtede, Cohen Eldan

机构信息

Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, Canada.

出版信息

JMIR Med Inform. 2025 Aug 18;13:e68955. doi: 10.2196/68955.

DOI:10.2196/68955
PMID:40825541
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12402740/
Abstract

BACKGROUND

The presence of stigmatizing language within electronic health records (EHRs) poses significant risks to patient care by perpetuating biases. While numerous studies have explored the use of supervised machine learning models to detect stigmatizing language automatically, these models require large, annotated datasets, which may not always be readily available. In-context learning (ICL) has emerged as a data-efficient alternative, allowing large language models to adapt to tasks using only instructions and examples.

OBJECTIVE

We aimed to investigate the efficacy of ICL in detecting stigmatizing language within EHRs under data-scarce conditions.

METHODS

We analyzed 5043 sentences from the Medical Information Mart for Intensive Care-IV dataset, which contains EHRs from patients admitted to the emergency department at the Beth Israel Deaconess Medical Center. We compared ICL with zero-shot (textual entailment), few-shot (SetFit), and supervised fine-tuning approaches. The ICL approach used 4 prompting strategies: generic, chain of thought, clue and reasoning prompting, and a newly introduced stigma detection guided prompt. Model fairness was evaluated using the equal performance criterion, measuring true positive rate, false positive rate, and F-score disparities across protected attributes, including sex, age, and race.

RESULTS

In the zero-shot setting, the best-performing ICL model, GEMMA-2, achieved a mean F-score of 0.858 (95% CI 0.854-0.862), showing an 18.7% improvement over the best textual entailment model, DEBERTA-M (mean F-score 0.723, 95% CI 0.718-0.728; P<.001). In the few-shot setting, the top ICL model, LLAMA-3, outperformed the leading SetFit models by 21.2%, 21.4%, and 12.3% with 4, 8, and 16 annotations per class, respectively (P<.001). Using 32 labeled instances, the best ICL model achieved a mean F-score of 0.901 (95% CI 0.895-0.907), only 3.2% lower than the best supervised fine-tuning model, ROBERTA (mean F-score 0.931, 95% CI 0.924-0.938), which was trained on 3543 labeled instances. Under the conditions tested, fairness evaluation revealed that supervised fine-tuning models exhibited greater bias compared with ICL models in the zero-shot, 4-shot, 8-shot, and 16-shot settings, as measured by true positive rate, false positive rate, and F-score disparities.

CONCLUSIONS

ICL offers a robust and flexible solution for detecting stigmatizing language in EHRs, offering a more data-efficient and equitable alternative to conventional machine learning methods. These findings suggest that ICL could enhance bias detection in clinical documentation while reducing the reliance on extensive labeled datasets.

摘要

背景

电子健康记录(EHR)中存在污名化语言,会因持续存在偏见而给患者护理带来重大风险。虽然众多研究探索了使用监督式机器学习模型自动检测污名化语言,但这些模型需要大量带注释的数据集,而这些数据集并非总是 readily 可得。上下文学习(ICL)已成为一种数据高效的替代方法,它允许大语言模型仅通过指令和示例来适应任务。

目的

我们旨在研究上下文学习在数据稀缺条件下检测电子健康记录中污名化语言的有效性。

方法

我们分析了重症监护-IV医疗信息集市数据集中的5043个句子,该数据集包含贝斯以色列女执事医疗中心急诊科收治患者的电子健康记录。我们将上下文学习与零样本(文本蕴含)、少样本(SetFit)和监督微调方法进行了比较。上下文学习方法使用了4种提示策略:通用提示、思维链提示、线索与推理提示以及新引入的污名检测引导提示。使用平等性能标准评估模型公平性,测量跨受保护属性(包括性别、年龄和种族)的真阳性率、假阳性率和F分数差异。

结果

在零样本设置中,表现最佳的上下文学习模型GEMMA-2的平均F分数为0.858(95%置信区间0.854 - 0.862),比最佳文本蕴含模型DEBERTA-M(平均F分数为0.723,95%置信区间0.718 - 0.728;P <.001)提高了18.7%。在少样本设置中,顶级上下文学习模型LLAMA-3在每类分别有4、八和16个注释时,分别比领先的SetFit模型高出21.2%、21.4%和12.3%(P <.001)。使用32个标记实例,最佳上下文学习模型的平均F分数为0.901(95%置信区间0.895 - 0.907),仅比在3543个标记实例上训练的最佳监督微调模型ROBERTA(平均F分数0.931,95%置信区间0.924 - 0.938)低3.2%。在测试条件下,公平性评估显示,通过真阳性率、假阳性率和F分数差异衡量,在零样本、4样本、8样本和16样本设置中,监督微调模型比上下文学习模型表现出更大偏差。

结论

上下文学习为检测电子健康记录中的污名化语言提供了一种强大且灵活的解决方案,为传统机器学习方法提供了一种更数据高效且公平的替代方法。这些发现表明,上下文学习可以增强临床文档中的偏差检测,同时减少对大量标记数据集的依赖。

相似文献

1
Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: Comparative Analysis and Validation Study.通过上下文学习在电子健康记录中高效检测污名化语言:比较分析与验证研究
JMIR Med Inform. 2025 Aug 18;13:e68955. doi: 10.2196/68955.
2
Detecting Stigmatizing Language in Clinical Notes with Large Language Models for Addiction Care.使用大语言模型在成瘾护理临床记录中检测污名化语言。
medRxiv. 2025 Aug 12:2025.08.08.25333315. doi: 10.1101/2025.08.08.25333315.
3
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
4
CARE-SD: classifier-based analysis for recognizing provider stigmatizing and doubt marker labels in electronic health records: model development and validation.CARE-SD:基于分类器的电子健康记录中识别医疗服务提供者污名化和怀疑标记标签的分析:模型开发与验证
J Am Med Inform Assoc. 2025 Feb 1;32(2):365-374. doi: 10.1093/jamia/ocae310.
5
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
6
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
7
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
8
Multicriteria Optimization of Language Models for Heart Failure With Preserved Ejection Fraction Symptom Detection in Spanish Electronic Health Records: Comparative Modeling Study.西班牙电子健康记录中射血分数保留的心力衰竭症状检测语言模型的多标准优化:比较建模研究
J Med Internet Res. 2025 Jul 17;27:e76433. doi: 10.2196/76433.
9
Keyword-optimized template insertion for clinical note classification via prompt-based learning.通过基于提示的学习进行关键词优化模板插入以实现临床笔记分类
BMC Med Inform Decis Mak. 2025 Jul 3;25(1):247. doi: 10.1186/s12911-025-03071-y.
10
Natural Language Processing and Coding for Detecting Bleeding Events in Discharge Summaries: Comparative Cross-Sectional Study.自然语言处理与出院小结中出血事件检测的编码:比较横断面研究
JMIR Med Inform. 2025 Aug 29;13:e67837. doi: 10.2196/67837.

本文引用的文献

1
Special Topic on Burnout: Clinical Implementation of Artificial Intelligence Scribes in Healthcare: A Systematic Review.职业倦怠专题:医疗保健中人工智能抄写员的临床应用:一项系统综述
Appl Clin Inform. 2025 Apr 30. doi: 10.1055/a-2597-2017.
2
Examining the development, effectiveness, and limitations of computer-aided diagnosis systems for retained surgical items detection: a systematic review.检查用于检测手术遗留物品的计算机辅助诊断系统的发展、有效性和局限性:一项系统综述。
Ergonomics. 2025 Apr 10:1-16. doi: 10.1080/00140139.2025.2487558.
3
Benchmarking large language models for biomedical natural language processing applications and recommendations.
用于生物医学自然语言处理应用的大型语言模型基准测试及建议。
Nat Commun. 2025 Apr 6;16(1):3280. doi: 10.1038/s41467-025-56989-2.
4
Intersection of Performance, Interpretability, and Fairness in Neural Prototype Tree for Chest X-Ray Pathology Detection: Algorithm Development and Validation Study.胸部X光病理检测神经原型树中性能、可解释性和公平性的交叉:算法开发与验证研究
JMIR Form Res. 2024 Dec 5;8:e59045. doi: 10.2196/59045.
5
De-identification is not enough: a comparison between de-identified and synthetic clinical notes.去识别化是不够的:去识别化与合成临床记录的比较。
Sci Rep. 2024 Nov 29;14(1):29669. doi: 10.1038/s41598-024-81170-y.
6
Identifying stigmatizing and positive/preferred language in obstetric clinical notes using natural language processing.使用自然语言处理识别产科临床记录中的污名化语言以及积极/偏好性语言。
J Am Med Inform Assoc. 2025 Feb 1;32(2):308-317. doi: 10.1093/jamia/ocae290.
7
Advancing Kidney Transplantation: A Machine Learning Approach to Enhance Donor-Recipient Matching.推进肾移植:一种用于优化供体-受体匹配的机器学习方法。
Diagnostics (Basel). 2024 Sep 25;14(19):2119. doi: 10.3390/diagnostics14192119.
8
Prompt Engineering Paradigms for Medical Applications: Scoping Review.医学应用的提示工程范式:范围综述。
J Med Internet Res. 2024 Sep 10;26:e60501. doi: 10.2196/60501.
9
Fairness gaps in Machine learning models for hospitalization and emergency department visit risk prediction in home healthcare patients with heart failure.机器学习模型在心力衰竭家庭保健患者住院和急诊就诊风险预测中的公平性差距。
Int J Med Inform. 2024 Nov;191:105534. doi: 10.1016/j.ijmedinf.2024.105534. Epub 2024 Jun 30.
10
A survey of recent methods for addressing AI fairness and bias in biomedicine.生物医学中解决人工智能公平性和偏见问题的最新方法综述。
J Biomed Inform. 2024 Jun;154:104646. doi: 10.1016/j.jbi.2024.104646. Epub 2024 Apr 25.