医学语言模型中认知偏差的评估与缓解

Evaluation and mitigation of cognitive biases in medical language models.

作者信息

Schmidgall Samuel, Harris Carl, Essien Ime, Olshvang Daniel, Rahman Tawsifur, Kim Ji Woong, Ziaei Rojin, Eshraghian Jason, Abadir Peter, Chellappa Rama

机构信息

Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

出版信息

NPJ Digit Med. 2024 Oct 21;7(1):295. doi: 10.1038/s41746-024-01283-6.

DOI:10.1038/s41746-024-01283-6

PMID:39433945

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11494053/

Abstract

Increasing interest in applying large language models (LLMs) to medicine is due in part to their impressive performance on medical exam questions. However, these exams do not capture the complexity of real patient-doctor interactions because of factors like patient compliance, experience, and cognitive bias. We hypothesized that LLMs would produce less accurate responses when faced with clinically biased questions as compared to unbiased ones. To test this, we developed the BiasMedQA dataset, which consists of 1273 USMLE questions modified to replicate common clinically relevant cognitive biases. We assessed six LLMs on BiasMedQA and found that GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which showed large drops in performance. Additionally, we introduced three bias mitigation strategies, which improved but did not fully restore accuracy. Our findings highlight the need to improve LLMs' robustness to cognitive biases, in order to achieve more reliable applications of LLMs in healthcare.

摘要

将大语言模型（LLMs）应用于医学领域的兴趣日益浓厚，部分原因在于它们在医学考试问题上的出色表现。然而，由于患者依从性、经验和认知偏差等因素，这些考试无法体现真实医患互动的复杂性。我们推测，与无偏差问题相比，当面对存在临床偏差的问题时，大语言模型会给出准确性较低的回答。为了验证这一点，我们开发了BiasMedQA数据集，该数据集由1273道美国医师执照考试（USMLE）问题组成，这些问题经过修改以复制常见的临床相关认知偏差。我们在BiasMedQA上评估了六个大语言模型，发现GPT-4在抗偏差方面表现突出，而Llama 2 70B-chat和PMC Llama 13B的性能则大幅下降。此外，我们引入了三种偏差缓解策略，这些策略提高了准确性，但并未完全恢复。我们的研究结果凸显了提高大语言模型对认知偏差的鲁棒性的必要性，以便在医疗保健领域更可靠地应用大语言模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5cc6/11494053/0370b6c24192/41746_2024_1283_Fig1_HTML.jpg

相似文献

Evaluation and mitigation of cognitive biases in medical language models.医学语言模型中认知偏差的评估与缓解

NPJ Digit Med. 2024 Oct 21;7(1):295. doi: 10.1038/s41746-024-01283-6.

Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.大语言模型在多专科招聘评估（MSRA）考试中的表现评估。

Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性：使用施瓦茨基本价值观理论的横断面研究。

JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.

Distilling large language models for matching patients to clinical trials.提炼大型语言模型以实现患者与临床试验的匹配。

J Am Med Inform Assoc. 2024 Sep 1;31(9):1953-1963. doi: 10.1093/jamia/ocae073.

Me-LLaMA: Foundation Large Language Models for Medical Applications.Me-LLaMA：用于医学应用的基础大语言模型。

Res Sq. 2024 May 22:rs.3.rs-4240043. doi: 10.21203/rs.3.rs-4240043/v1.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较：评估研究。

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量：评估研究

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较：大型语言模型、ChatGPT 和未经训练的急诊医生：一项对比研究。

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力：定性研究

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

引用本文的文献

Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning.评估精神病学中的聊天机器人：基于拉施模型对临床知识和推理的见解。

PLoS One. 2025 Aug 14;20(8):e0330303. doi: 10.1371/journal.pone.0330303. eCollection 2025.

Fine-tuning large language models for interdisciplinary environmental challenges.针对跨学科环境挑战微调大语言模型。

Environ Sci Ecotechnol. 2025 Jul 28;27:100608. doi: 10.1016/j.ese.2025.100608. eCollection 2025 Sep.

Cognitive bias in clinical large language models.临床大语言模型中的认知偏差。

NPJ Digit Med. 2025 Jul 10;8(1):428. doi: 10.1038/s41746-025-01790-0.

Forewarning Artificial Intelligence about Cognitive Biases.向人工智能预先警示认知偏差。

Med Decis Making. 2025 Oct;45(7):913-916. doi: 10.1177/0272989X251346788. Epub 2025 Jun 24.

Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.医学诊断中的大语言模型：基于文献计量分析的综述

J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.

Enhancing the performance of neurosurgery medical question-answering systems using a multi-task knowledge graph-augmented answer generation model.使用多任务知识图谱增强的答案生成模型提高神经外科医学问答系统的性能。

Front Neurosci. 2025 May 20;19:1606038. doi: 10.3389/fnins.2025.1606038. eCollection 2025.

Enhancing patient-centered information on implant dentistry through prompt engineering: a comparison of four large language models.通过提示工程增强种植牙科以患者为中心的信息：四种大语言模型的比较

Front Oral Health. 2025 Apr 7;6:1566221. doi: 10.3389/froh.2025.1566221. eCollection 2025.

A scoping review on generative AI and large language models in mitigating medication related harm.关于生成式人工智能和大语言模型在减轻药物相关危害方面的范围综述。

NPJ Digit Med. 2025 Mar 28;8(1):182. doi: 10.1038/s41746-025-01565-7.

Bias in medical AI: Implications for clinical decision-making.医学人工智能中的偏差：对临床决策的影响。

PLOS Digit Health. 2024 Nov 7;3(11):e0000651. doi: 10.1371/journal.pdig.0000651. eCollection 2024 Nov.

本文引用的文献

PMC-LLaMA: toward building open-source language models for medicine.PMC-LLaMA：为医学构建开源语言模型的努力。

J Am Med Inform Assoc. 2024 Sep 1;31(9):1833-1843. doi: 10.1093/jamia/ocae045.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力：一项模型评估研究。

Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.

Large language models propagate race-based medicine.大语言模型传播基于种族的医学观念。

NPJ Digit Med. 2023 Oct 20;6(1):195. doi: 10.1038/s41746-023-00939-z.

Considerations for addressing bias in artificial intelligence for health equity.解决人工智能中影响健康公平性的偏差的考量因素。

NPJ Digit Med. 2023 Sep 12;6(1):170. doi: 10.1038/s41746-023-00913-9.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Large language models encode clinical knowledge.大语言模型编码临床知识。

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Bias in AI-based models for medical applications: challenges and mitigation strategies.基于人工智能的医学应用模型中的偏差：挑战与缓解策略。

NPJ Digit Med. 2023 Jun 14;6(1):113. doi: 10.1038/s41746-023-00858-z.

Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine.注意力并非全部所需：在医疗保健和医学中使用大型语言模型所涉及的复杂伦理问题。

EBioMedicine. 2023 Apr;90:104512. doi: 10.1016/j.ebiom.2023.104512. Epub 2023 Mar 15.

Implicit bias in healthcare: clinical practice, research and decision making.医疗保健中的隐性偏见：临床实践、研究与决策

Future Healthc J. 2021 Mar;8(1):40-48. doi: 10.7861/fhj.2020-0233.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

医学语言模型中认知偏差的评估与缓解

Evaluation and mitigation of cognitive biases in medical language models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献