评估 GPT-4 提供医疗建议的表现：与人类专家的比较分析。

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

机构信息

Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea.

Department of Linguistics, Korea University, Seoul, Republic of Korea.

出版信息

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

DOI:10.2196/51282

PMID:38989848

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11250047/

Abstract

BACKGROUND

Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI's GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse.

OBJECTIVE

This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses.

METHODS

We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio.

RESULTS

GPT-4 and human experts displayed comparable efficacy in medical accuracy ("GPT-4 is better" at 132/251, 52.6% vs "Human expert is better" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience.

CONCLUSIONS

GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.

摘要

背景

准确的医学建议对于确保最佳的患者护理至关重要，而错误的信息可能会导致决策失误，从而对健康产生潜在的不利影响。OpenAI 的 GPT-4 等大型语言模型的出现引发了人们对其在医疗保健应用中的潜在兴趣，特别是在自动化医疗咨询方面。然而，将其性能与人类专家进行严格比较的研究仍然很少。

目的

本研究旨在使用真实世界的用户生成查询，比较 GPT-4 在提供医学建议方面的医学准确性与人类专家的表现，特别关注心脏病学。它还旨在分析 GPT-4 和人类专家在特定问题类别中的表现，包括药物或药物信息和初步诊断。

方法

我们通过互联网门户收集了 251 对来自普通用户的心脏病学特定问题和来自人类专家的答案。GPT-4 被要求对相同的问题做出回应。三名独立的心脏病专家 (SL、JHK 和 JJC) 评估了人类专家和 GPT-4 提供的答案。每个评估者使用计算机界面比较这些对，并确定哪个答案更好，他们还定量测量了问题的清晰度和复杂性以及答案的准确性和适当性，应用了 3 级评分量表（低、中、高）。此外，还进行了语言分析，以通过字数和类型 - 标记比来比较答案的长度和词汇多样性。

结果

GPT-4 和人类专家在医学准确性方面表现出相当的效果（“GPT-4 更好”为 132/251，52.6%，“人类专家更好”为 119/251，47.4%）。在准确性水平分类中，人类专家的高准确性回答比例高于 GPT-4（50/237，21.1%对 30/238，12.6%），但低准确性回答的比例也更高（11/237，4.6%对 1/238，0.4%；P=.001）。GPT-4 的回答通常比人类专家的回答更长，使用的词汇多样性更小，这可能增强了它们对普通用户的可理解性（句子数：平均值 10.9，标准差 4.2 对平均值 5.9，标准差 3.7；P<.001；类型 - 标记比：平均值 0.69，标准差 0.07 对平均值 0.79，标准差 0.09；P<.001）。然而，人类专家在特定问题类别中表现优于 GPT-4，特别是与药物或药物信息和初步诊断相关的问题。这些发现突出了 GPT-4 在基于临床经验提供建议方面的局限性。

结论

GPT-4 在自动化医疗咨询方面显示出了有希望的潜力，其医学准确性与人类专家相当。然而，在细微的临床判断方面仍然存在挑战。未来的大型语言模型改进可能需要整合特定的临床推理途径和监管监督，以确保安全使用。需要进一步研究以了解大型语言模型在各种医学专业和疾病中的全部潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caa4/11250047/8616e20ee245/mededu-v10-e51282-g001.jpg

相似文献

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.

JMIR Med Educ. 2024 Aug 16;10:e59213. doi: 10.2196/59213.

A Generative Pretrained Transformer (GPT)-Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study.

JMIR Med Educ. 2024 Jan 16;10:e53961. doi: 10.2196/53961.

Comparing Artificial Intelligence-Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study.

J Med Internet Res. 2025 May 7;27:e67830. doi: 10.2196/67830.

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.

J Med Internet Res. 2024 Jun 27;26:e54571. doi: 10.2196/54571.

The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education.

J Orthop. 2023 Nov 23;50:70-75. doi: 10.1016/j.jor.2023.11.056. eCollection 2024 Apr.

引用本文的文献

The performance of ChatGPT on medical image-based assessments and implications for medical education.

BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.

Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation.

BMC Emerg Med. 2025 Jul 31;25(1):142. doi: 10.1186/s12873-025-01303-y.

Large language models for disease diagnosis: a scoping review.

NPJ Artif Intell. 2025;1(1):9. doi: 10.1038/s44387-025-00011-z. Epub 2025 Jun 9.

Enhancing treatment decision-making for low back pain: a novel framework integrating large language models with retrieval-augmented generation technology.

Front Med (Lausanne). 2025 May 14;12:1599241. doi: 10.3389/fmed.2025.1599241. eCollection 2025.

Integrating generative AI and machine learning classifiers for solving heterogenous MCGDM: a case of employee churn prediction.

Sci Rep. 2025 May 5;15(1):15645. doi: 10.1038/s41598-025-99119-0.

Development of a GPT-4-Powered Virtual Simulated Patient and Communication Training Platform for Medical Students to Practice Discussing Abnormal Mammogram Results With Patients: Multiphase Study.

JMIR Form Res. 2025 Apr 17;9:e65670. doi: 10.2196/65670.

Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology.

Pediatr Nephrol. 2025 Mar 5. doi: 10.1007/s00467-025-06723-3.

Generative artificial intelligence in graduate medical education.

Front Med (Lausanne). 2025 Jan 10;11:1525604. doi: 10.3389/fmed.2024.1525604. eCollection 2024.

Harnessing the Power of ChatGPT in Cardiovascular Medicine: Innovations, Challenges, and Future Directions.

J Clin Med. 2024 Oct 31;13(21):6543. doi: 10.3390/jcm13216543.

本文引用的文献

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.

Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

Large language models encode clinical knowledge.

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

The imperative for regulatory oversight of large language models (or generative AI) in healthcare.

NPJ Digit Med. 2023 Jul 6;6(1):120. doi: 10.1038/s41746-023-00873-0.

Generative AI in Health Care and Liability Risks for Physicians and Safety Concerns for Patients.

JAMA. 2023 Jul 25;330(4):313-314. doi: 10.1001/jama.2023.9630.

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

Artificial Intelligence Chatbots in Allergy and Immunology Practice: Where Have We Been and Where Are We Going?

J Allergy Clin Immunol Pract. 2023 Sep;11(9):2697-2700. doi: 10.1016/j.jaip.2023.05.042. Epub 2023 Jun 8.

Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions.

JMIR Med Educ. 2023 Jun 1;9:e48291. doi: 10.2196/48291.

Artificial Intelligence in Intensive Care Medicine: Toward a ChatGPT/GPT-4 Way?

Ann Biomed Eng. 2023 Sep;51(9):1898-1903. doi: 10.1007/s10439-023-03234-w. Epub 2023 May 13.

The potential impact of ChatGPT/GPT-4 on surgery: will it topple the profession of surgeons?

Int J Surg. 2023 May 1;109(5):1545-1547. doi: 10.1097/JS9.0000000000000388.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.

N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估 GPT-4 提供医疗建议的表现：与人类专家的比较分析。

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献