Suppr超能文献

大语言模型对肝癌监测、诊断和管理问题的反应:准确性、可靠性、可读性。

Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.

机构信息

Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.

Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA.

出版信息

Abdom Radiol (NY). 2024 Dec;49(12):4286-4294. doi: 10.1007/s00261-024-04501-7. Epub 2024 Aug 1.

Abstract

PURPOSE

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

METHODS

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

RESULTS

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

CONCLUSION

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

摘要

目的

评估现有的大型语言模型在回答肝细胞癌诊断和管理基本问题方面的准确性、可靠性和可读性。

方法

向 ChatGPT-3.5(OpenAI)、Gemini(Google)和 Bing(Microsoft)重复提出了 20 个关于肝癌诊断和管理的问题。由来自三个学术肝移植中心的六名研究员级医师评估回答。回答分为准确(得分为 1;所有信息均为真实且相关)、不充分(得分为 0;所有信息均为真实,但未完全回答问题或提供不相关信息)或不准确(得分为-1;任何信息均为虚假)。记录平均值和标准差。如果平均得分为>0,则认为整体回答准确;如果单个问题的所有回答的平均得分为>0,则认为回答可靠。使用 Flesch 阅读易读性评分和 Flesch-Kincaid 年级水平对可读性进行量化。使用单向方差分析和 Tukey 多重比较检验比较 60 次回答的可读性和准确性。

结果

在 20 个问题中,ChatGPT 回答了 9 个(45%),Gemini 回答了 12 个(60%),Bing 回答了 6 个(30%)问题准确;但只有 6 个(30%)、8 个(40%)和 3 个(15%)的回答既准确又可靠。任何聊天机器人之间的准确性都没有显著差异。ChatGPT 的回答可读性最低(平均 Flesch 阅读易读性评分 29;大学毕业),其次是 Gemini(30;大学)和 Bing(40;大学;p<0.001)。

结论

大型语言模型对肝细胞癌诊断和管理的基本问题提供了复杂的回答,但准确性、可靠性或可读性都很差。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验