通过上下文学习在电子健康记录中高效检测污名化语言：比较分析与验证研究

Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: Comparative Analysis and Validation Study.

作者信息

Chen Hongbo, Alfred Myrtede, Cohen Eldan

机构信息

Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, Canada.

出版信息

JMIR Med Inform. 2025 Aug 18;13:e68955. doi: 10.2196/68955.

DOI:10.2196/68955

PMID:40825541

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12402740/

Abstract

BACKGROUND

The presence of stigmatizing language within electronic health records (EHRs) poses significant risks to patient care by perpetuating biases. While numerous studies have explored the use of supervised machine learning models to detect stigmatizing language automatically, these models require large, annotated datasets, which may not always be readily available. In-context learning (ICL) has emerged as a data-efficient alternative, allowing large language models to adapt to tasks using only instructions and examples.

OBJECTIVE

We aimed to investigate the efficacy of ICL in detecting stigmatizing language within EHRs under data-scarce conditions.

METHODS

We analyzed 5043 sentences from the Medical Information Mart for Intensive Care-IV dataset, which contains EHRs from patients admitted to the emergency department at the Beth Israel Deaconess Medical Center. We compared ICL with zero-shot (textual entailment), few-shot (SetFit), and supervised fine-tuning approaches. The ICL approach used 4 prompting strategies: generic, chain of thought, clue and reasoning prompting, and a newly introduced stigma detection guided prompt. Model fairness was evaluated using the equal performance criterion, measuring true positive rate, false positive rate, and F-score disparities across protected attributes, including sex, age, and race.

RESULTS

In the zero-shot setting, the best-performing ICL model, GEMMA-2, achieved a mean F-score of 0.858 (95% CI 0.854-0.862), showing an 18.7% improvement over the best textual entailment model, DEBERTA-M (mean F-score 0.723, 95% CI 0.718-0.728; P<.001). In the few-shot setting, the top ICL model, LLAMA-3, outperformed the leading SetFit models by 21.2%, 21.4%, and 12.3% with 4, 8, and 16 annotations per class, respectively (P<.001). Using 32 labeled instances, the best ICL model achieved a mean F-score of 0.901 (95% CI 0.895-0.907), only 3.2% lower than the best supervised fine-tuning model, ROBERTA (mean F-score 0.931, 95% CI 0.924-0.938), which was trained on 3543 labeled instances. Under the conditions tested, fairness evaluation revealed that supervised fine-tuning models exhibited greater bias compared with ICL models in the zero-shot, 4-shot, 8-shot, and 16-shot settings, as measured by true positive rate, false positive rate, and F-score disparities.

CONCLUSIONS

ICL offers a robust and flexible solution for detecting stigmatizing language in EHRs, offering a more data-efficient and equitable alternative to conventional machine learning methods. These findings suggest that ICL could enhance bias detection in clinical documentation while reducing the reliance on extensive labeled datasets.

摘要

背景

电子健康记录（EHR）中存在污名化语言，会因持续存在偏见而给患者护理带来重大风险。虽然众多研究探索了使用监督式机器学习模型自动检测污名化语言，但这些模型需要大量带注释的数据集，而这些数据集并非总是 readily 可得。上下文学习（ICL）已成为一种数据高效的替代方法，它允许大语言模型仅通过指令和示例来适应任务。

目的

我们旨在研究上下文学习在数据稀缺条件下检测电子健康记录中污名化语言的有效性。

方法

我们分析了重症监护-IV医疗信息集市数据集中的5043个句子，该数据集包含贝斯以色列女执事医疗中心急诊科收治患者的电子健康记录。我们将上下文学习与零样本（文本蕴含）、少样本（SetFit）和监督微调方法进行了比较。上下文学习方法使用了4种提示策略：通用提示、思维链提示、线索与推理提示以及新引入的污名检测引导提示。使用平等性能标准评估模型公平性，测量跨受保护属性（包括性别、年龄和种族）的真阳性率、假阳性率和F分数差异。

结果

在零样本设置中，表现最佳的上下文学习模型GEMMA-2的平均F分数为0.858（95%置信区间0.854 - 0.862），比最佳文本蕴含模型DEBERTA-M（平均F分数为0.723，95%置信区间0.718 - 0.728；P <.001）提高了18.7%。在少样本设置中，顶级上下文学习模型LLAMA-3在每类分别有4、八和16个注释时，分别比领先的SetFit模型高出21.2%、21.4%和12.3%（P <.001）。使用32个标记实例，最佳上下文学习模型的平均F分数为0.901（95%置信区间0.895 - 0.907），仅比在3543个标记实例上训练的最佳监督微调模型ROBERTA（平均F分数0.931，95%置信区间0.924 - 0.938）低3.2%。在测试条件下，公平性评估显示，通过真阳性率、假阳性率和F分数差异衡量，在零样本、4样本、8样本和16样本设置中，监督微调模型比上下文学习模型表现出更大偏差。