人工智能在一般医学知识方面表现优于医生，但在儿科领域除外：一项横断面研究。

Artificial Intelligence Outperforms Physicians in General Medical Knowledge, Except in the Paediatrics Domain: A Cross-Sectional Study.

作者信息

Miranda Joana, Pereira-Silva Raquel, Guichard João, Meneses Jorge, Carreira Andreia Neves, Seixas Daniela

机构信息

Tonic Easy Medical, S.A., 4300-259 Porto, Portugal.

出版信息

Bioengineering (Basel). 2025 Jun 14;12(6):653. doi: 10.3390/bioengineering12060653.

DOI:10.3390/bioengineering12060653

PMID:40564469

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12190018/

Abstract

Generative artificial intelligence (genAI) shows promising results in clinical practice. This study compared a GPT-4-turbo virtual assistant with physicians from Italy, France, Spain, and Portugal on medical knowledge derived from national exams while analysing knowledge retention over time and domain-specific performance. Via a digital platform, 17,144 physicians provided 221,574 answers to 600 exam questions between December 2022 and February 2024. Physicians were stratified by years since graduation and specialty, and the assistant answered the same questions in each native language. Differences in proportions of correct answers were tested with binomial logistic regression (odds ratios, 95% CI) or Fisher's exact test (α = 0.05). The assistant outperformed physicians in all countries (72-96% vs. 46-62%; logistic regression, < 0.001). Physicians also trailed the assistant across most knowledge domains ( < 0.001), except paediatrics (45% vs. 52%; Fisher, = 0.60). Accuracy declined with seniority, falling 4-10% between the youngest and oldest cohorts (logistic regression, < 0.001). Overall, genAI exceeds practising doctors on broad medical knowledge and may help counter knowledge attrition, though paediatrics remains a domain requiring targeted refinement.

摘要

生成式人工智能（genAI）在临床实践中显示出了令人鼓舞的成果。本研究将GPT-4-turbo虚拟助手与来自意大利、法国、西班牙和葡萄牙的医生在国家考试中的医学知识方面进行了比较，同时分析了知识随时间的保持情况和特定领域的表现。通过一个数字平台，在2022年12月至2024年2月期间，17144名医生对600道考试题目提供了221574个答案。医生按毕业年限和专业进行分层，该助手用每种母语回答相同的问题。用二项逻辑回归（优势比，95%置信区间）或费舍尔精确检验（α = 0.05）来检验正确答案比例的差异。在所有国家，该助手的表现都优于医生（72%-96%对46%-62%；逻辑回归，P < 0.001）。在大多数知识领域，医生也落后于该助手（P < 0.001），儿科领域除外（45%对52%；费舍尔检验，P = 0.60）。准确性随着资历的增加而下降，在最年轻和最年长的队列之间下降了4%-10%（逻辑回归，P < 0.001）。总体而言，在广泛的医学知识方面，生成式人工智能超过了执业医生，可能有助于应对知识流失，不过儿科仍然是一个需要针对性改进的领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82c7/12190018/8d8f4a7a16aa/bioengineering-12-00653-g001.jpg

相似文献

Artificial Intelligence Outperforms Physicians in General Medical Knowledge, Except in the Paediatrics Domain: A Cross-Sectional Study.人工智能在一般医学知识方面表现优于医生，但在儿科领域除外：一项横断面研究。

Bioengineering (Basel). 2025 Jun 14;12(6):653. doi: 10.3390/bioengineering12060653.

Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较：随机对照试验

JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4（GPT-4）分析三种不同语言医学笔记的潜力：一项回顾性模型评估研究。

Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Patient education in the management of coronary heart disease.冠心病管理中的患者教育

Cochrane Database Syst Rev. 2017 Jun 28;6(6):CD008895. doi: 10.1002/14651858.CD008895.pub3.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Artificial intelligence for detecting keratoconus.人工智能在圆锥角膜检测中的应用。

Cochrane Database Syst Rev. 2023 Nov 15;11(11):CD014911. doi: 10.1002/14651858.CD014911.pub2.

Clinical judgement by primary care physicians for the diagnosis of all-cause dementia or cognitive impairment in symptomatic people.初级保健医生对有症状人群进行全因痴呆或认知障碍诊断的临床判断。

Cochrane Database Syst Rev. 2022 Jun 16;6(6):CD012558. doi: 10.1002/14651858.CD012558.pub2.

Augmenting intensive care unit nursing practice with generative AI: A formative study of diagnostic synergies using simulation-based clinical cases.利用生成式人工智能增强重症监护病房护理实践：一项基于模拟临床病例的诊断协同形成性研究。

J Clin Nurs. 2024 Aug 5. doi: 10.1111/jocn.17384.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

Pediatric Predictive Artificial Intelligence Implemented in Clinical Practice from 2010 to 2021: A Systematic Review.2010年至2021年在临床实践中应用的儿科预测人工智能：一项系统评价。

Appl Clin Inform. 2025 May;16(3):477-487. doi: 10.1055/a-2521-1508. Epub 2025 Jan 21.

Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答

Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.

Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels.生成式人工智能模型在预测儿科急诊严重程度指数水平中的评估

Pediatr Emerg Care. 2025 Apr 1;41(4):251-255. doi: 10.1097/PEC.0000000000003315. Epub 2025 Jan 7.

Further Reflections on the Use of Large Language Models in Pediatrics.关于在儿科学中使用大语言模型的进一步思考。

JAMA Pediatr. 2024 Jun 1;178(6):628-629. doi: 10.1001/jamapediatrics.2024.0729.

Medical Expectations of Physicians on AI Solutions in Daily Practice: Cross-Sectional Survey Study.医生在日常实践中对人工智能解决方案的医学期望：横断面调查研究

JMIRx Med. 2024 Mar 25;5:e50803. doi: 10.2196/50803.

Preliminary Evidence of the Use of Generative AI in Health Care Clinical Services: Systematic Narrative Review.生成式人工智能在医疗保健临床服务中应用的初步证据：系统叙述性综述

JMIR Med Inform. 2024 Mar 20;12:e52073. doi: 10.2196/52073.

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

Artificial Intelligence-Based Clinical Decision Support Systems in Cardiovascular Diseases.基于人工智能的心血管疾病临床决策支持系统

Anatol J Cardiol. 2024 Jan 7;28(2):74-86. doi: 10.14744/AnatolJCardiol.2023.3685.

Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies.大型语言模型在儿科病例研究中的诊断准确性。

JAMA Pediatr. 2024 Mar 1;178(3):313-315. doi: 10.1001/jamapediatrics.2023.5750.

Diagnosis of COVID-19 Using Chest X-ray Images and Disease Symptoms Based on Stacking Ensemble Deep Learning.基于堆叠集成深度学习的胸部X光图像和疾病症状对COVID-19的诊断

Diagnostics (Basel). 2023 Jun 5;13(11):1968. doi: 10.3390/diagnostics13111968.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能在一般医学知识方面表现优于医生，但在儿科领域除外：一项横断面研究。

Artificial Intelligence Outperforms Physicians in General Medical Knowledge, Except in the Paediatrics Domain: A Cross-Sectional Study.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献