• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人类与人工智能的协作能最准确地诊断临床案例。

Human-AI collectives most accurately diagnose clinical vignettes.

作者信息

Zöller Nikolas, Berger Julian, Lin Irving, Fu Nathan, Komarneni Jayanth, Barabucci Gioele, Laskowski Kyle, Shia Victor, Harack Benjamin, Chu Eugene A, Trianni Vito, Kurvers Ralf H J M, Herzog Stefan M

机构信息

Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin 14195, Germany.

The Human Diagnosis Project, San Francisco, CA 94110.

出版信息

Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122. Epub 2025 Jun 13.

DOI:10.1073/pnas.2426153122
PMID:40512795
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12184336/
Abstract

AI systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased-shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here, we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 text-based medical case vignettes. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

摘要

人工智能系统,尤其是大语言模型(LLMs),越来越多地被用于影响个人和整个社会的高风险决策中,而往往没有足够的保障措施来确保安全性、质量和公平性。然而,大语言模型会产生幻觉、缺乏常识且存在偏差——这些缺点可能反映了大语言模型的固有局限性,因此可能无法通过更复杂的架构、更多的数据或更多的人工反馈来弥补。因此,仅仅依靠大语言模型做出复杂的高风险决策是有问题的。在此,我们提出一种混合集体智能系统,该系统通过利用人类经验的互补优势和大语言模型处理的大量信息来降低这些风险。我们将我们的方法应用于开放式医学诊断,将医生做出的40762种鉴别诊断与五个最先进的大语言模型对2133个基于文本的医学病例 vignettes 的诊断相结合。我们表明,医生和大语言模型的混合集体在表现上优于单个医生和医生集体,以及单个大语言模型和大语言模型集成。这一结果在一系列医学专业和专业经验中都成立,并且可以归因于人类和大语言模型的互补贡献,这些贡献导致了不同类型的错误。我们的方法突出了人类和机器集体智能在提高医学诊断等复杂、开放式领域准确性方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/0d93b1731fb7/pnas.2426153122fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/610bca45eaff/pnas.2426153122fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/afc6a63a1015/pnas.2426153122fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/fa1afbff5d83/pnas.2426153122fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/0d93b1731fb7/pnas.2426153122fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/610bca45eaff/pnas.2426153122fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/afc6a63a1015/pnas.2426153122fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/fa1afbff5d83/pnas.2426153122fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4953/12184336/0d93b1731fb7/pnas.2426153122fig04.jpg

相似文献

1
Human-AI collectives most accurately diagnose clinical vignettes.人类与人工智能的协作能最准确地诊断临床案例。
Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122. Epub 2025 Jun 13.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
4
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
5
Large language models show amplified cognitive biases in moral decision-making.大语言模型在道德决策中表现出放大的认知偏差。
Proc Natl Acad Sci U S A. 2025 Jun 24;122(25):e2412015122. doi: 10.1073/pnas.2412015122. Epub 2025 Jun 20.
6
Large Language Model Architectures in Health Care: Scoping Review of Research Perspectives.医疗保健中的大语言模型架构:研究视角的范围综述
J Med Internet Res. 2025 Jun 19;27:e70315. doi: 10.2196/70315.
7
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.
8
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
9
Antidepressants for pain management in adults with chronic pain: a network meta-analysis.抗抑郁药治疗成人慢性疼痛的疼痛管理:一项网络荟萃分析。
Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.
10
Continuing education meetings and workshops: effects on professional practice and healthcare outcomes.继续教育会议和研讨会:对专业实践和医疗保健结果的影响。
Cochrane Database Syst Rev. 2021 Sep 15;9(9):CD003030. doi: 10.1002/14651858.CD003030.pub3.

本文引用的文献

1
A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.生成式人工智能与医生诊断性能比较的系统评价与荟萃分析
NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.
2
AI-enhanced collective intelligence.人工智能增强的集体智慧。
Patterns (N Y). 2024 Oct 10;5(11):101074. doi: 10.1016/j.patter.2024.101074. eCollection 2024 Nov 8.
3
How large language models can reshape collective intelligence.大语言模型如何重塑集体智慧。
Nat Hum Behav. 2024 Sep;8(9):1643-1655. doi: 10.1038/s41562-024-01959-9. Epub 2024 Sep 20.
4
A multimodal generative AI copilot for human pathology.用于人体病理学的多模态生成式人工智能副驾。
Nature. 2024 Oct;634(8033):466-473. doi: 10.1038/s41586-024-07618-3. Epub 2024 Jun 12.
5
Boosting wisdom of the crowd for medical image annotation using training performance and task features.利用训练性能和任务特征提升医学图像标注的群体智慧。
Cogn Res Princ Implic. 2024 May 20;9(1):31. doi: 10.1186/s41235-024-00558-6.
6
Collective Intelligence Increases Diagnostic Accuracy in a General Practice Setting.群体智能提高了一般实践环境下的诊断准确性。
Med Decis Making. 2024 May;44(4):451-462. doi: 10.1177/0272989X241241001. Epub 2024 Apr 12.
7
Deep learning-aided decision support for diagnosis of skin disease across skin tones.深度学习辅助肤色相关皮肤病诊断的决策支持。
Nat Med. 2024 Feb;30(2):573-583. doi: 10.1038/s41591-023-02728-3. Epub 2024 Feb 5.
8
Automation Bias and Assistive AI: Risk of Harm From AI-Driven Clinical Decision Support.自动化偏差与辅助性人工智能:人工智能驱动的临床决策支持带来的伤害风险
JAMA. 2023 Dec 19;330(23):2255-2257. doi: 10.1001/jama.2023.22557.
9
Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study.测量人工智能在住院患者诊断中的影响:一项随机临床病例调查研究。
JAMA. 2023 Dec 19;330(23):2275-2284. doi: 10.1001/jama.2023.22295.
10
Large language models propagate race-based medicine.大语言模型传播基于种族的医学观念。
NPJ Digit Med. 2023 Oct 20;6(1):195. doi: 10.1038/s41746-023-00939-z.