在回答患者问题方面，大型语言模型聊天机器人的表现是否优于成熟的患者信息资源？一项关于黑色素瘤的比较研究。

Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.

作者信息

Kamminga Nadia C W, Kievits June E C, Plaisier Peter W, Burgers Jako S, van der Veldt Astrid M, van den Brand Jan A G J, Mulder Mark, Wakkee Marlies, Lugtenberg Marjolein, Nijsten Tamar

机构信息

Department of Dermatology, Erasmus MC Cancer Institute, University Medical Center Rotterdam, the Netherlands.

Department of Surgery, Albert Schweitzer Hospital, Dordrecht, the Netherlands.

出版信息

Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.

DOI:10.1093/bjd/ljae377

PMID:39365602

Abstract

BACKGROUND

Large language models (LLMs) have a potential role in providing adequate patient information.

OBJECTIVES

To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.

METHODS

Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.

RESULTS

Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.

CONCLUSIONS

Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.

摘要

背景

大语言模型在提供充分的患者信息方面具有潜在作用。

目的

比较大语言模型的回答质量与荷兰已有的患者信息资源（PIRs）在回答有关黑色素瘤的患者问题时的质量。

方法

在基线时检查ChatGPT 3.5和4.0版本、Gemini以及三个荷兰领先的黑色素瘤PIRs对50个黑色素瘤特定问题的回答，并在8个月后再次检查大语言模型的回答。结果包括（医学）准确性、完整性、个性化、可读性，此外还包括大语言模型的可重复性。使用弗里德曼方差分析在大语言模型和PIRs内部进行比较分析，并使用威尔科克森符号秩检验在表现最佳的大语言模型和金标准（GS）PIRs之间进行比较分析。

结果

在大语言模型中，ChatGPT-3.5表现出最高的准确性（P = 0.009）。Gemini在完整性（P < 0.001）、个性化（P = 0.007）和可读性（P < 0.001）方面表现最佳。PIRs在准确性和完整性方面较为一致，全科医生网站在个性化（P = 0.013）和可读性（P < 0.001）方面表现出色。表现最佳的大语言模型在完整性和个性化方面优于GS-PIR，但准确性和可读性较低。随着时间的推移，所有大语言模型的回答可重复性均下降，不同结果存在差异。

结论

尽管大语言模型在为患者关于黑色素瘤的问题提供高度个性化和完整的回答方面显示出潜力，但在它们能够取代或补充传统PIRs之前，提高并保障准确性、可重复性和可及性至关重要。

相似文献

Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面，大型语言模型聊天机器人的表现是否优于成熟的患者信息资源？一项关于黑色素瘤的比较研究。

Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.

Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性：混合方法研究

J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.

Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study.探索大语言模型在乙型肝炎感染相关问题上的表现：一项比较研究。

World J Gastroenterol. 2025 Jan 21;31(3):101092. doi: 10.3748/wjg.v31.i3.101092.

Artificial intelligence chatbots versus traditional medical resources for patient education on "Labor Epidurals": an evaluation of accuracy, emotional tone, and readability.用于“分娩硬膜外麻醉”患者教育的人工智能聊天机器人与传统医学资源的比较：准确性、情感基调及可读性评估

Int J Obstet Anesth. 2025 Feb;61:104302. doi: 10.1016/j.ijoa.2024.104302. Epub 2024 Nov 26.

A Comparison of Prostate Cancer Screening Information Quality on Standard and Advanced Versions of ChatGPT, Google Gemini, and Microsoft Copilot: A Cross-Sectional Study.ChatGPT标准版本与高级版本、谷歌Gemini和微软Copilot上前列腺癌筛查信息质量的比较：一项横断面研究。

Am J Health Promot. 2025 Jun;39(5):766-776. doi: 10.1177/08901171251316371. Epub 2025 Jan 24.

Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study.评估大型语言模型作为炎症性肠病患者教育工具的效果：一项比较研究。

World J Gastroenterol. 2025 Feb 14;31(6):102090. doi: 10.3748/wjg.v31.i6.102090.

Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.大语言模型关于梨状肌综合征的能力：质量、准确性、完整性和可读性研究。

HSS J. 2025 May 20:15563316251340697. doi: 10.1177/15563316251340697.

Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis.ChatGPT-4o mini、ChatGPT-4o与Gemini Advanced在绝经后骨质疏松症治疗中的对比分析。

BMC Musculoskelet Disord. 2025 Apr 16;26(1):369. doi: 10.1186/s12891-025-08601-3.

Evaluation of the reliability and readability of answers given by chatbots to frequently asked questions about endophthalmitis: A cross-sectional study on chatbots.评估聊天机器人对眼内炎常见问题回答的可靠性和可读性：一项关于聊天机器人的横断面研究。

Health Informatics J. 2024 Oct-Dec;30(4):14604582241304679. doi: 10.1177/14604582241304679.

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

引用本文的文献

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.使用两个大语言模型提取临床指南信息：评估研究

J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.

Readability of AI-Generated Patient Information Leaflets on Alzheimer's, Vascular Dementia, and Delirium.关于阿尔茨海默病、血管性痴呆和谵妄的人工智能生成的患者信息手册的可读性。

Cureus. 2025 Jun 6;17(6):e85463. doi: 10.7759/cureus.85463. eCollection 2025 Jun.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在回答患者问题方面，大型语言模型聊天机器人的表现是否优于成熟的患者信息资源？一项关于黑色素瘤的比较研究。

Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVES

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献