Suppr超能文献

在提供心脏治疗咨询时,合并症患者的ChatGPT性能会下降。

ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations.

作者信息

Hao Wen-Rui, Chen Chun-Chao, Chen Kuan, Li Long-Chen, Chiu Chun-Chih, Yang Tsung-Yeh, Jong Hung-Chang, Yang Hsuan-Chia, Huang Chih-Wei, Liu Ju-Chi, Li Yu-Chuan Jack

机构信息

Taipei Heart Institute, Taipei Medical University, Taipei 11002, Taiwan.

Division of Cardiology, Department of Internal Medicine, School of Medicine, College of Medicine, Taipei Medical University, Taipei 11002, Taiwan.

出版信息

Healthcare (Basel). 2025 Jul 3;13(13):1598. doi: 10.3390/healthcare13131598.

Abstract

: Large language models (LLMs) like ChatGPT are increasingly being explored for medical applications. However, their reliability in providing medication advice for patients with complex clinical situations, particularly those with multiple comorbidities, remains uncertain and under-investigated. This study aimed to systematically evaluate the performance, consistency, and safety of ChatGPT in generating medication recommendations for complex cardiovascular disease (CVD) scenarios. : In this simulation-based study (21 January-1 February 2024), ChatGPT 3.5 and 4.0 were prompted 10 times for each of 25 scenarios, representing five common CVDs paired with five major comorbidities. A panel of five cardiologists independently classified each unique drug recommendation as "high priority" or "low priority". Key metrics included physician approval rates, the proportion of high-priority recommendations, response consistency (Jaccard similarity index), and error pattern analysis. Statistical comparisons were made using Z-tests, chi-square tests, and Wilcoxon Signed-Rank tests. : The overall physician approval rate for GPT-4 (86.90%) was modestly but significantly higher than that for GPT-3.5 (85.06%; = 0.0476) based on aggregated data. However, a more rigorous paired-scenario analysis of high-priority recommendations revealed no statistically significant difference between the models ( = 0.407), indicating the advantage is not systematic. A chi-square test confirmed significant differences in error patterns ( < 0.001); notably, GPT-4 more frequently recommended contraindicated drugs in high-risk scenarios. Inter-model consistency was low (mean Jaccard index = 0.42), showing the models often provide different advice. : While demonstrating high overall physician approval rates, current LLMs exhibit inconsistent performance and pose significant safety risks when providing medication advice for complex CVD cases. Their reliability does not yet meet the standards for autonomous clinical application. Future work must focus on leveraging real-world data for validation and developing domain-specific, fine-tuned models to enhance safety and accuracy. Until then, vigilant professional oversight is indispensable.

摘要

像ChatGPT这样的大语言模型(LLMs)正越来越多地被探索用于医学应用。然而,它们在为具有复杂临床情况的患者,尤其是那些患有多种合并症的患者提供用药建议时的可靠性,仍然不确定且研究不足。本研究旨在系统评估ChatGPT在为复杂心血管疾病(CVD)场景生成用药建议时的性能、一致性和安全性。

在这项基于模拟的研究中(2024年1月21日至2月1日),针对25种场景中的每一种,分别对ChatGPT 3.5和4.0进行了10次提示,这些场景代表了五种常见的心血管疾病与五种主要合并症的配对。由五名心脏病专家组成的小组将每一项独特的药物建议独立分类为“高优先级”或“低优先级”。关键指标包括医生批准率、高优先级建议的比例、反应一致性(杰卡德相似性指数)和错误模式分析。使用Z检验、卡方检验和威尔科克森符号秩检验进行统计比较。

基于汇总数据,GPT - 4的总体医生批准率(86.90%)略高于GPT - 3.5(85.06%;P = 0.0476),但差异显著。然而,对高优先级建议进行更严格的配对场景分析发现,模型之间没有统计学上的显著差异(P = 0.407),这表明优势并不具有系统性。卡方检验证实了错误模式存在显著差异(P < 0.001);值得注意的是,GPT - 4在高风险场景中更频繁地推荐禁忌药物。模型间的一致性较低(平均杰卡德指数 = 0.42),表明模型经常提供不同的建议。

虽然当前的大语言模型总体医生批准率较高,但在为复杂的心血管疾病病例提供用药建议时,表现出不一致的性能,并带来重大安全风险。它们的可靠性尚未达到自主临床应用的标准。未来的工作必须专注于利用真实世界的数据进行验证,并开发特定领域的、经过微调的模型,以提高安全性和准确性。在此之前,警惕的专业监督是必不可少的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9df0/12249446/f912032f9bde/healthcare-13-01598-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验