Department of Urology, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Urological Sciences, University of British Columbia, Stone Centre at Vancouver General Hospital, Vancouver, British Columbia, Canada.
Department of Urological Sciences, University of British Columbia, Stone Centre at Vancouver General Hospital, Vancouver, British Columbia, Canada.
Urology. 2024 Apr;186:107-113. doi: 10.1016/j.urology.2023.11.042. Epub 2024 Feb 21.
To compare the readability and accuracy of large language model generated patient information materials (PIMs) to those supplied by the American Urological Association (AUA), Canadian Urological Association (CUA), and European Association of Urology (EAU) for kidney stones.
PIMs from AUA, CUA, and EAU related to nephrolithiasis were obtained and categorized. The most frequent patient questions related to kidney stones were identified from an internet query and input into GPT-3.5 and GPT-4. PIMs and ChatGPT outputs were assessed for accuracy and readability using previously published indexes. We also assessed changes in ChatGPT outputs when a reading level was specified (grade 6).
Readability scores were better for PIMs from the CUA (grade level 10-12), AUA (8-10), or EAU (9-11) compared to the chatbot. GPT-3.5 had the worst readability scores at grade 13-14 and GPT-4 was likewise less readable than urologic organization PIMs with scores of 11-13. While organizational PIMs were deemed to be accurate, the chatbot had high accuracy with minor details omitted. GPT-4 was more accurate in general stone information, dietary and medical management of kidney stones topics in comparison to GPT-3.5, while both models had the same accuracy in the surgical management of nephrolithiasis topics.
Current PIMs from major urologic organizations for kidney stones remain more readable than publicly available GPT outputs, but they are still higher than the reading ability of the general population. Of the available PIMs for kidney stones, those from the AUA are the most readable. Although Chatbot outputs for common kidney stone patient queries have a high degree of accuracy with minor omitted details, it is important for clinicians to understand their strengths and limitations.
比较大型语言模型生成的患者信息材料(PIMs)与美国泌尿外科学会(AUA)、加拿大泌尿外科学会(CUA)和欧洲泌尿外科学会(EAU)提供的关于肾结石的 PIM 的可读性和准确性。
获取并分类了 AUA、CUA 和 EAU 与肾结石相关的 PIM。从互联网查询中确定了与肾结石最相关的患者常见问题,并将其输入到 GPT-3.5 和 GPT-4 中。使用先前发表的指标评估 PIM 和 ChatGPT 输出的准确性和可读性。我们还评估了当指定阅读水平(等级 6)时 ChatGPT 输出的变化。
与聊天机器人相比,CUA(等级 10-12)、AUA(8-10)或 EAU(9-11)的 PIM 具有更好的可读性评分。GPT-3.5 的可读性得分最差,为 13-14 级,而 GPT-4 的可读性也不如泌尿科组织的 PIM,得分在 11-13 级。虽然组织的 PIM 被认为是准确的,但聊天机器人在省略了一些细节的情况下仍具有很高的准确性。与 GPT-3.5 相比,GPT-4 在一般结石信息、肾结石的饮食和药物管理方面更为准确,而这两种模型在肾结石的手术治疗方面具有相同的准确性。
目前,主要泌尿科组织针对肾结石的 PIM 仍然比可公开获取的 GPT 输出更具可读性,但仍高于一般人群的阅读能力。在肾结石的可用 PIM 中,AUA 的最具可读性。虽然聊天机器人输出对常见肾结石患者查询具有高度准确性,但忽略了一些细节,临床医生了解其优缺点很重要。