• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型对肝癌监测、诊断和管理问题的反应:准确性、可靠性、可读性。

Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.

机构信息

Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.

Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA.

出版信息

Abdom Radiol (NY). 2024 Dec;49(12):4286-4294. doi: 10.1007/s00261-024-04501-7. Epub 2024 Aug 1.

DOI:10.1007/s00261-024-04501-7
PMID:39088019
Abstract

PURPOSE

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

METHODS

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

RESULTS

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

CONCLUSION

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

摘要

目的

评估现有的大型语言模型在回答肝细胞癌诊断和管理基本问题方面的准确性、可靠性和可读性。

方法

向 ChatGPT-3.5(OpenAI)、Gemini(Google)和 Bing(Microsoft)重复提出了 20 个关于肝癌诊断和管理的问题。由来自三个学术肝移植中心的六名研究员级医师评估回答。回答分为准确(得分为 1;所有信息均为真实且相关)、不充分(得分为 0;所有信息均为真实,但未完全回答问题或提供不相关信息)或不准确(得分为-1;任何信息均为虚假)。记录平均值和标准差。如果平均得分为>0,则认为整体回答准确;如果单个问题的所有回答的平均得分为>0,则认为回答可靠。使用 Flesch 阅读易读性评分和 Flesch-Kincaid 年级水平对可读性进行量化。使用单向方差分析和 Tukey 多重比较检验比较 60 次回答的可读性和准确性。

结果

在 20 个问题中,ChatGPT 回答了 9 个(45%),Gemini 回答了 12 个(60%),Bing 回答了 6 个(30%)问题准确;但只有 6 个(30%)、8 个(40%)和 3 个(15%)的回答既准确又可靠。任何聊天机器人之间的准确性都没有显著差异。ChatGPT 的回答可读性最低(平均 Flesch 阅读易读性评分 29;大学毕业),其次是 Gemini(30;大学)和 Bing(40;大学;p<0.001)。

结论

大型语言模型对肝细胞癌诊断和管理的基本问题提供了复杂的回答,但准确性、可靠性或可读性都很差。

相似文献

1
Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.大语言模型对肝癌监测、诊断和管理问题的反应:准确性、可靠性、可读性。
Abdom Radiol (NY). 2024 Dec;49(12):4286-4294. doi: 10.1007/s00261-024-04501-7. Epub 2024 Aug 1.
2
Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性:公众需谨慎。
Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.
3
Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.评估大型语言模型(ChatGPT-4、Claude 3、Gemini和Microsoft Copilot)对早产儿视网膜病变常见问题的回答:一项关于可读性和适宜性的研究
J Pediatr Ophthalmol Strabismus. 2025 Mar-Apr;62(2):84-95. doi: 10.3928/01913913-20240911-05. Epub 2024 Oct 28.
4
Microsoft Copilot Provides More Accurate and Reliable Information About Anterior Cruciate Ligament Injury and Repair Than ChatGPT and Google Gemini; However, No Resource Was Overall the Best.与ChatGPT和谷歌Gemini相比,微软Copilot能提供关于前交叉韧带损伤与修复的更准确、更可靠的信息;然而,没有一种资源在各方面都是最佳的。
Arthrosc Sports Med Rehabil. 2024 Nov 19;7(2):101043. doi: 10.1016/j.asmr.2024.101043. eCollection 2025 Apr.
5
Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.评估ChatGPT对放疗相关患者问题回答的质量和可靠性:与GPT-3.5和GPT-4的比较研究
JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.
6
Is Information About Musculoskeletal Malignancies From Large Language Models or Web Resources at a Suitable Reading Level for Patients?来自大语言模型或网络资源的关于肌肉骨骼恶性肿瘤的信息对患者来说是否处于合适的阅读水平?
Clin Orthop Relat Res. 2025 Feb 1;483(2):306-315. doi: 10.1097/CORR.0000000000003263. Epub 2024 Sep 25.
7
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面,大型语言模型聊天机器人的表现是否优于成熟的患者信息资源?一项关于黑色素瘤的比较研究。
Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.
8
Performance of Artificial Intelligence Chatbots in Responding to Patient Queries Related to Traumatic Dental Injuries: A Comparative Study.人工智能聊天机器人在回应与创伤性牙损伤相关的患者咨询中的表现:一项比较研究。
Dent Traumatol. 2025 Jun;41(3):338-347. doi: 10.1111/edt.13020. Epub 2024 Nov 22.
9
Evaluation of the reliability and readability of answers given by chatbots to frequently asked questions about endophthalmitis: A cross-sectional study on chatbots.评估聊天机器人对眼内炎常见问题回答的可靠性和可读性:一项关于聊天机器人的横断面研究。
Health Informatics J. 2024 Oct-Dec;30(4):14604582241304679. doi: 10.1177/14604582241304679.
10
Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.作为根管再治疗患者信息来源的人工智能聊天机器人回复的可读性、准确性、恰当性和质量:一项比较评估。
Int J Med Inform. 2025 Sep;201:105948. doi: 10.1016/j.ijmedinf.2025.105948. Epub 2025 Apr 25.

引用本文的文献

1
Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information: Development and Evaluation Study.减少生成式人工智能聊天机器人提供癌症信息时的幻觉及反应权衡:开发与评估研究
JMIR Cancer. 2025 Sep 11;11:e70176. doi: 10.2196/70176.
2
Comparison of the readability of ChatGPT and Bard in medical communication: a meta-analysis.ChatGPT与Bard在医学交流中的可读性比较:一项荟萃分析。
BMC Med Inform Decis Mak. 2025 Sep 1;25(1):325. doi: 10.1186/s12911-025-03035-2.
3
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

本文引用的文献

1
Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards.根据公认的报告标准,对 ChatGPT 和人类评估者在医学文献评估方面的比较研究。
BMJ Health Care Inform. 2023 Oct;30(1). doi: 10.1136/bmjhci-2023-100830.
2
Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0.提高急诊室中转移性前列腺癌患者的分诊效率和准确性:使用ChatGPT 4.0的人工智能辅助分诊的回顾性分析
Cancers (Basel). 2023 Jul 22;15(14):3717. doi: 10.3390/cancers15143717.
3
大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
4
Revolutionizing MASLD: How Artificial Intelligence Is Shaping the Future of Liver Care.重塑代谢相关脂肪性肝病:人工智能如何塑造肝脏护理的未来。
Cancers (Basel). 2025 Feb 20;17(5):722. doi: 10.3390/cancers17050722.
5
Application of large language models in disease diagnosis and treatment.大语言模型在疾病诊断与治疗中的应用。
Chin Med J (Engl). 2025 Jan 20;138(2):130-142. doi: 10.1097/CM9.0000000000003456. Epub 2024 Dec 26.
Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT's Answers to Common Questions About Lung Cancer and Lung Cancer Screening.
使用ChatGPT、GPT-4和Bard来提高ChatGPT对肺癌及肺癌筛查常见问题回答的可读性。
AJR Am J Roentgenol. 2023 Nov;221(5):701-704. doi: 10.2214/AJR.23.29622. Epub 2023 Jun 21.
4
Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports.解读放射学报告:OpenAI ChatGPT 潜在应用于增强患者对诊断报告的理解。
Clin Imaging. 2023 Sep;101:137-141. doi: 10.1016/j.clinimag.2023.06.008. Epub 2023 Jun 8.
5
How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard.人工智能如何回答常见肺癌问题:ChatGPT 与 Google Bard 对比。
Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.
6
Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis.ChatGPT提供的关于肝癌监测与诊断信息的准确性。
AJR Am J Roentgenol. 2023 Oct;221(4):556-559. doi: 10.2214/AJR.23.29493. Epub 2023 May 24.
7
Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现:当前优势和局限性的深入了解。
Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.
8
Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT.ChatGPT提供的乳腺癌预防和筛查建议的适宜性。
Radiology. 2023 May;307(4):e230424. doi: 10.1148/radiol.230424. Epub 2023 Apr 4.
9
Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma.评估 ChatGPT 在回答肝硬化和肝细胞癌相关问题方面的表现。
Clin Mol Hepatol. 2023 Jul;29(3):721-732. doi: 10.3350/cmh.2023.0089. Epub 2023 Mar 22.
10
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.