School of Medicine, University College Cork, Cork, Ireland.
Department of Urology, Mercy University Hospital, Cork, Ireland.
World J Urol. 2024 Jul 29;42(1):455. doi: 10.1007/s00345-024-05146-3.
Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content. The potential benefits of LLMs in healthcare is predicted to be immense. The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics.
Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate). PILs were evaluated using a quality assessment checklist. PIL readability was assessed by the Average Reading Level Consensus Calculator.
PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images. Medical inaccuracies were present in all generated content including instances of significant error. Readability analysis identified PaLM 2 generated PILs as the simplest (age 14-15 average reading level). Llama 2 PILs were the most difficult (age 16-17 average).
While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images. LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design. How satisfied patients are to LLM-generated PILs remains to be evaluated.
大型语言模型(LLM)是一种人工智能(AI),它使用深度学习技术来理解、总结和生成内容。预计 LLM 在医疗保健中的潜在益处是巨大的。本研究的目的是检查 3 种 LLM 在泌尿科主题上生成的患者信息传单(PIL)的质量。
为 3 种 LLM(ChatGPT-4、PaLM 2(谷歌 Bard)和 Llama 2(Meta))生成 PIL 创建了提示,涵盖四个泌尿科主题(包皮环切术、肾切除术、膀胱过度活动症和经尿道前列腺切除术)。使用质量评估检查表评估 PIL。使用平均阅读水平共识计算器评估 PIL 的可读性。
PaLM 2 生成的 PIL 总体平均质量评分最高(3.58),其次是 Llama 2(3.34)和 ChatGPT-4(3.08)。除 TURP 外,PaLM 2 生成的 PIL 在所有主题中质量最高,并且是唯一包含图像的 LLM。所有生成的内容都存在医疗不准确之处,包括严重错误的实例。可读性分析确定 PaLM 2 生成的 PIL 最简洁(14-15 岁平均阅读水平)。Llama 2 PIL 最难(16-17 岁平均)。
虽然 LLM 可以生成可能有助于减轻医疗保健专业人员工作量的 PIL,但生成的内容需要临床医生输入以确保准确性,并包含健康素养辅助工具,如图像。LLM 生成的 PIL 高于成人的平均阅读水平,需要改进 LLM 算法和/或提示设计。患者对 LLM 生成的 PIL 的满意度如何仍有待评估。