大语言模型对肝癌监测、诊断和管理问题的反应：准确性、可靠性、可读性。

Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.

机构信息

Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.

Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA.

出版信息

Abdom Radiol (NY). 2024 Dec;49(12):4286-4294. doi: 10.1007/s00261-024-04501-7. Epub 2024 Aug 1.

DOI:10.1007/s00261-024-04501-7

PMID:39088019

Abstract

PURPOSE

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

METHODS

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

RESULTS

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

CONCLUSION

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

摘要

目的

评估现有的大型语言模型在回答肝细胞癌诊断和管理基本问题方面的准确性、可靠性和可读性。

方法

向 ChatGPT-3.5（OpenAI）、Gemini（Google）和 Bing（Microsoft）重复提出了 20 个关于肝癌诊断和管理的问题。由来自三个学术肝移植中心的六名研究员级医师评估回答。回答分为准确（得分为 1；所有信息均为真实且相关）、不充分（得分为 0；所有信息均为真实，但未完全回答问题或提供不相关信息）或不准确（得分为-1；任何信息均为虚假）。记录平均值和标准差。如果平均得分为>0，则认为整体回答准确；如果单个问题的所有回答的平均得分为>0，则认为回答可靠。使用 Flesch 阅读易读性评分和 Flesch-Kincaid 年级水平对可读性进行量化。使用单向方差分析和 Tukey 多重比较检验比较 60 次回答的可读性和准确性。

结果

在 20 个问题中，ChatGPT 回答了 9 个（45%），Gemini 回答了 12 个（60%），Bing 回答了 6 个（30%）问题准确；但只有 6 个（30%）、8 个（40%）和 3 个（15%）的回答既准确又可靠。任何聊天机器人之间的准确性都没有显著差异。ChatGPT 的回答可读性最低（平均 Flesch 阅读易读性评分 29；大学毕业），其次是 Gemini（30；大学）和 Bing（40；大学；p<0.001）。

结论

大型语言模型对肝细胞癌诊断和管理的基本问题提供了复杂的回答，但准确性、可靠性或可读性都很差。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大语言模型对肝癌监测、诊断和管理问题的反应：准确性、可靠性、可读性。

Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

大语言模型对肝癌监测、诊断和管理问题的反应：准确性、可靠性、可读性。

Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献