评估 GPT-4 提供医疗建议的表现:与人类专家的比较分析。

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

机构信息

Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea.

Department of Linguistics, Korea University, Seoul, Republic of Korea.

出版信息

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

Abstract

BACKGROUND

Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI's GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse.

OBJECTIVE

This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses.

METHODS

We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio.

RESULTS

GPT-4 and human experts displayed comparable efficacy in medical accuracy ("GPT-4 is better" at 132/251, 52.6% vs "Human expert is better" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience.

CONCLUSIONS

GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.

摘要

背景

准确的医学建议对于确保最佳的患者护理至关重要,而错误的信息可能会导致决策失误,从而对健康产生潜在的不利影响。OpenAI 的 GPT-4 等大型语言模型的出现引发了人们对其在医疗保健应用中的潜在兴趣,特别是在自动化医疗咨询方面。然而,将其性能与人类专家进行严格比较的研究仍然很少。

目的

本研究旨在使用真实世界的用户生成查询,比较 GPT-4 在提供医学建议方面的医学准确性与人类专家的表现,特别关注心脏病学。它还旨在分析 GPT-4 和人类专家在特定问题类别中的表现,包括药物或药物信息和初步诊断。

方法

我们通过互联网门户收集了 251 对来自普通用户的心脏病学特定问题和来自人类专家的答案。GPT-4 被要求对相同的问题做出回应。三名独立的心脏病专家 (SL、JHK 和 JJC) 评估了人类专家和 GPT-4 提供的答案。每个评估者使用计算机界面比较这些对,并确定哪个答案更好,他们还定量测量了问题的清晰度和复杂性以及答案的准确性和适当性,应用了 3 级评分量表(低、中、高)。此外,还进行了语言分析,以通过字数和类型 - 标记比来比较答案的长度和词汇多样性。

结果

GPT-4 和人类专家在医学准确性方面表现出相当的效果(“GPT-4 更好”为 132/251,52.6%,“人类专家更好”为 119/251,47.4%)。在准确性水平分类中,人类专家的高准确性回答比例高于 GPT-4(50/237,21.1%对 30/238,12.6%),但低准确性回答的比例也更高(11/237,4.6%对 1/238,0.4%;P=.001)。GPT-4 的回答通常比人类专家的回答更长,使用的词汇多样性更小,这可能增强了它们对普通用户的可理解性(句子数:平均值 10.9,标准差 4.2 对平均值 5.9,标准差 3.7;P<.001;类型 - 标记比:平均值 0.69,标准差 0.07 对平均值 0.79,标准差 0.09;P<.001)。然而,人类专家在特定问题类别中表现优于 GPT-4,特别是与药物或药物信息和初步诊断相关的问题。这些发现突出了 GPT-4 在基于临床经验提供建议方面的局限性。

结论

GPT-4 在自动化医疗咨询方面显示出了有希望的潜力,其医学准确性与人类专家相当。然而,在细微的临床判断方面仍然存在挑战。未来的大型语言模型改进可能需要整合特定的临床推理途径和监管监督,以确保安全使用。需要进一步研究以了解大型语言模型在各种医学专业和疾病中的全部潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caa4/11250047/8616e20ee245/mededu-v10-e51282-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索