• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型与医生在医学伦理问答方面的分歧

Disagreements in Medical Ethics Question Answering Between Large Language Models and Physicians.

作者信息

Soffer Shelly, Nesselroth Dafna, Pragier Keren, Anteby Roi, Apakama Donald, Holmes Emma, Sawant Ashwin Shreekant, Abbott Ethan, Lepow Lauren Alyse, Vasudev Ishita, Lampert Joshua, Gendler Moran, Horesh Nir, Efros Orly, Glicksberg Benjamin S, Freeman Robert, Reich David L, Charney Alexander W, Nadkarni Girish N, Klang Eyal

机构信息

Rabin Medical Center.

Meuhedet Health Services.

出版信息

Res Sq. 2024 Nov 15:rs.3.rs-5382879. doi: 10.21203/rs.3.rs-5382879/v1.

DOI:10.21203/rs.3.rs-5382879/v1
PMID:39606472
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11601831/
Abstract

IMPORTANCE

Medical ethics is inherently complex, shaped by a broad spectrum of opinions, experiences, and cultural perspectives. The integration of large language models (LLMs) in healthcare is new and requires an understanding of their consistent adherence to ethical standards.

OBJECTIVE

To compare the agreement rates in answering questions based on ethically ambiguous situations between three frontier LLMs (GPT-4, Gemini-pro-1.5, and Llama-3-70b) and a multi-disciplinary physician group.

METHODS

In this cross-sectional study, three LLMs generated 1,248 medical ethics questions. These questions were derived based on the principles outlined in the American College of Physicians Ethics Manual. The topics spanned traditional, inclusive, interdisciplinary, and contemporary themes. Each model was then tasked in answering all generated questions. Twelve practicing physicians evaluated and responded to a randomly selected 10% subset of these questions. We compared agreement rates in question answering among the physicians, between the physicians and LLMs, and among LLMs.

RESULTS

The models generated a total of 3,744 answers. Despite physicians perceiving the questions' complexity as moderate, with scores between 2 and 3 on a 5-point scale, their agreement rate was only 55.9%. The agreement between physicians and LLMs was also low at 57.9%. In contrast, the agreement rate among LLMs was notably higher at 76.8% (p < 0.001), emphasizing the consistency in LLM responses compared to both physician-physician and physician-LLM agreement.

CONCLUSIONS

LLMs demonstrate higher agreement rates in ethically complex scenarios compared to physicians, suggesting their potential utility as consultants in ambiguous ethical situations. Future research should explore how LLMs can enhance consistency while adapting to the complexities of real-world ethical dilemmas.

摘要

重要性

医学伦理本质上很复杂,受到广泛的观点、经验和文化视角的影响。大语言模型(LLMs)在医疗保健中的整合是新事物,需要了解它们对道德标准的持续遵守情况。

目的

比较三种前沿大语言模型(GPT-4、Gemini-pro-1.5和Llama-3-70b)与一个多学科医生小组在回答基于道德模糊情况的问题时的一致率。

方法

在这项横断面研究中,三种大语言模型生成了1248个医学伦理问题。这些问题是根据美国医师协会伦理手册中概述的原则得出的。主题涵盖传统、包容性、跨学科和当代主题。然后要求每个模型回答所有生成的问题。12名执业医生对这些问题中随机抽取的10%子集进行评估并做出回应。我们比较了医生之间、医生与大语言模型之间以及大语言模型之间在问题回答上的一致率。

结果

这些模型总共生成了3744个答案。尽管医生认为这些问题的复杂性为中等,在5分制量表上的得分在2到3分之间,但他们的一致率仅为55.9%。医生与大语言模型之间的一致率也很低,为57.9%。相比之下,大语言模型之间的一致率显著更高,为76.8%(p < 0.001),这强调了与医生之间以及医生与大语言模型之间的一致率相比,大语言模型回答的一致性。

结论

与医生相比,大语言模型在道德复杂的场景中表现出更高的一致率,表明它们在模糊的道德情况下作为顾问的潜在效用。未来的研究应探索大语言模型如何在适应现实世界道德困境的复杂性的同时提高一致性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/870231185c62/nihpp-rs5382879v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/45c9da36e253/nihpp-rs5382879v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/e730f4ac8655/nihpp-rs5382879v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/01a83b3b5083/nihpp-rs5382879v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/2942ad0588fc/nihpp-rs5382879v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/870231185c62/nihpp-rs5382879v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/45c9da36e253/nihpp-rs5382879v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/e730f4ac8655/nihpp-rs5382879v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/01a83b3b5083/nihpp-rs5382879v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/2942ad0588fc/nihpp-rs5382879v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11e9/11601831/870231185c62/nihpp-rs5382879v1-f0005.jpg

相似文献

1
Disagreements in Medical Ethics Question Answering Between Large Language Models and Physicians.大型语言模型与医生在医学伦理问答方面的分歧
Res Sq. 2024 Nov 15:rs.3.rs-5382879. doi: 10.21203/rs.3.rs-5382879/v1.
2
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
3
Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination.评估大语言模型在回答和分析中国牙科执业资格考试方面的表现。
Eur J Dent Educ. 2025 May;29(2):332-340. doi: 10.1111/eje.13073. Epub 2025 Jan 31.
4
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
5
Performance of Large Language Models on a Neurology Board-Style Examination.大语言模型在神经科 board-style 考试中的表现。
JAMA Netw Open. 2023 Dec 1;6(12):e2346721. doi: 10.1001/jamanetworkopen.2023.46721.
6
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
7
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
8
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
9
Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.语义临床人工智能与原生大语言模型在美国医师执照考试中的表现对比
JAMA Netw Open. 2025 Apr 1;8(4):e256359. doi: 10.1001/jamanetworkopen.2025.6359.
10
Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study.探索大语言模型在乙型肝炎感染相关问题上的表现:一项比较研究。
World J Gastroenterol. 2025 Jan 21;31(3):101092. doi: 10.3748/wjg.v31.i3.101092.

本文引用的文献

1
Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.评估生成式预训练转换器(GPT)在临床决策中的应用:GPT-3.5 和 GPT-4 的对比分析。
J Med Internet Res. 2024 Jun 27;26:e54571. doi: 10.2196/54571.
2
Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room.评估最先进的大型语言模型在预测急诊入院方面的准确性。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1921-1928. doi: 10.1093/jamia/ocae103.
3
Attributions toward artificial agents in a modified Moral Turing Test.
在改良的道德图灵测试中对人工代理的归因。
Sci Rep. 2024 Apr 30;14(1):8458. doi: 10.1038/s41598-024-58087-7.
4
Fundamentals of Medical Ethics - A New Perspective Series.《医学伦理学基础——新视角系列》
N Engl J Med. 2023 Dec 21;389(25):2392-2394. doi: 10.1056/NEJMe2308472.
5
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
6
The International Code of Medical Ethics of the World Medical Association.世界医学协会《国际医学伦理守则》
JAMA. 2022 Oct 13. doi: 10.1001/jama.2022.19697.
7
Chatbot breakthrough in the 2020s? An ethical reflection on the trend of automated consultations in health care.2020 年代的聊天机器人突破?对医疗保健中自动化咨询趋势的伦理反思。
Med Health Care Philos. 2022 Mar;25(1):61-71. doi: 10.1007/s11019-021-10049-w. Epub 2021 Sep 4.
8
Exploring medical ethics' implementation challenges: A qualitative study.探索医学伦理的实施挑战:一项定性研究。
J Educ Health Promot. 2021 Feb 27;10:66. doi: 10.4103/jehp.jehp_766_20. eCollection 2021.
9
Principles of Clinical Ethics and Their Application to Practice.临床伦理学原则及其在实践中的应用。
Med Princ Pract. 2021;30(1):17-28. doi: 10.1159/000509119. Epub 2020 Jun 4.
10
American College of Physicians Ethics Manual: Seventh Edition.美国医师学院伦理手册:第七版。
Ann Intern Med. 2019 Jan 15;170(2_Suppl):S1-S32. doi: 10.7326/M18-2160.