Kamminga Nadia C W, Kievits June E C, Plaisier Peter W, Burgers Jako S, van der Veldt Astrid M, van den Brand Jan A G J, Mulder Mark, Wakkee Marlies, Lugtenberg Marjolein, Nijsten Tamar
Department of Dermatology, Erasmus MC Cancer Institute, University Medical Center Rotterdam, the Netherlands.
Department of Surgery, Albert Schweitzer Hospital, Dordrecht, the Netherlands.
Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.
Large language models (LLMs) have a potential role in providing adequate patient information.
To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.
Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.
Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.
Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.
大语言模型在提供充分的患者信息方面具有潜在作用。
比较大语言模型的回答质量与荷兰已有的患者信息资源(PIRs)在回答有关黑色素瘤的患者问题时的质量。
在基线时检查ChatGPT 3.5和4.0版本、Gemini以及三个荷兰领先的黑色素瘤PIRs对50个黑色素瘤特定问题的回答,并在8个月后再次检查大语言模型的回答。结果包括(医学)准确性、完整性、个性化、可读性,此外还包括大语言模型的可重复性。使用弗里德曼方差分析在大语言模型和PIRs内部进行比较分析,并使用威尔科克森符号秩检验在表现最佳的大语言模型和金标准(GS)PIRs之间进行比较分析。
在大语言模型中,ChatGPT-3.5表现出最高的准确性(P = 0.009)。Gemini在完整性(P < 0.001)、个性化(P = 0.007)和可读性(P < 0.001)方面表现最佳。PIRs在准确性和完整性方面较为一致,全科医生网站在个性化(P = 0.013)和可读性(P < 0.001)方面表现出色。表现最佳的大语言模型在完整性和个性化方面优于GS-PIR,但准确性和可读性较低。随着时间的推移,所有大语言模型的回答可重复性均下降,不同结果存在差异。
尽管大语言模型在为患者关于黑色素瘤的问题提供高度个性化和完整的回答方面显示出潜力,但在它们能够取代或补充传统PIRs之前,提高并保障准确性、可重复性和可及性至关重要。