Suppr超能文献

大语言模型向公众提供前列腺癌信息的准确性、可读性和可理解性。

Accuracy, readability, and understandability of large language models for prostate cancer information to the public.

作者信息

Hershenhouse Jacob S, Mokhtar Daniel, Eppler Michael B, Rodler Severin, Storino Ramacciotti Lorenzo, Ganjavi Conner, Hom Brian, Davis Ryan J, Tran John, Russo Giorgio Ivan, Cocci Andrea, Abreu Andre, Gill Inderbir, Desai Mihir, Cacciamani Giovanni E

机构信息

USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.

Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA.

出版信息

Prostate Cancer Prostatic Dis. 2024 May 14. doi: 10.1038/s41391-024-00826-y.

Abstract

BACKGROUND

Generative Pretrained Model (GPT) chatbots have gained popularity since the public release of ChatGPT. Studies have evaluated the ability of different GPT models to provide information about medical conditions. To date, no study has assessed the quality of ChatGPT outputs to prostate cancer related questions from both the physician and public perspective while optimizing outputs for patient consumption.

METHODS

Nine prostate cancer-related questions, identified through Google Trends (Global), were categorized into diagnosis, treatment, and postoperative follow-up. These questions were processed using ChatGPT 3.5, and the responses were recorded. Subsequently, these responses were re-inputted into ChatGPT to create simplified summaries understandable at a sixth-grade level. Readability of both the original ChatGPT responses and the layperson summaries was evaluated using validated readability tools. A survey was conducted among urology providers (urologists and urologists in training) to rate the original ChatGPT responses for accuracy, completeness, and clarity using a 5-point Likert scale. Furthermore, two independent reviewers evaluated the layperson summaries on correctness trifecta: accuracy, completeness, and decision-making sufficiency. Public assessment of the simplified summaries' clarity and understandability was carried out through Amazon Mechanical Turk (MTurk). Participants rated the clarity and demonstrated their understanding through a multiple-choice question.

RESULTS

GPT-generated output was deemed correct by 71.7% to 94.3% of raters (36 urologists, 17 urology residents) across 9 scenarios. GPT-generated simplified layperson summaries of this output was rated as accurate in 8 of 9 (88.9%) scenarios and sufficient for a patient to make a decision in 8 of 9 (88.9%) scenarios. Mean readability of layperson summaries was higher than original GPT outputs ([original ChatGPT v. simplified ChatGPT, mean (SD), p-value] Flesch Reading Ease: 36.5(9.1) v. 70.2(11.2), <0.0001; Gunning Fog: 15.8(1.7) v. 9.5(2.0), p < 0.0001; Flesch Grade Level: 12.8(1.2) v. 7.4(1.7), p < 0.0001; Coleman Liau: 13.7(2.1) v. 8.6(2.4), 0.0002; Smog index: 11.8(1.2) v. 6.7(1.8), <0.0001; Automated Readability Index: 13.1(1.4) v. 7.5(2.1), p < 0.0001). MTurk workers (n = 514) rated the layperson summaries as correct (89.5-95.7%) and correctly understood the content (63.0-87.4%).

CONCLUSION

GPT shows promise for correct patient education for prostate cancer-related contents, but the technology is not designed for delivering patients information. Prompting the model to respond with accuracy, completeness, clarity and readability may enhance its utility when used for GPT-powered medical chatbots.

摘要

背景

自ChatGPT公开发布以来,生成式预训练模型(GPT)聊天机器人广受欢迎。已有研究评估了不同GPT模型提供有关医疗状况信息的能力。迄今为止,尚无研究从医生和公众的角度评估ChatGPT对前列腺癌相关问题的回答质量,同时优化回答以便患者理解。

方法

通过谷歌趋势(全球)确定了九个与前列腺癌相关的问题,分为诊断、治疗和术后随访三类。使用ChatGPT 3.5处理这些问题,并记录回答。随后,将这些回答重新输入ChatGPT以创建六年级水平可理解的简化摘要。使用经过验证的可读性工具评估原始ChatGPT回答和外行人摘要的可读性。对泌尿外科医生(泌尿外科医生和实习泌尿外科医生)进行了一项调查,使用5点李克特量表对原始ChatGPT回答的准确性、完整性和清晰度进行评分。此外,两名独立评审员根据正确性三要素(准确性、完整性和决策充分性)对外行人摘要进行评估。通过亚马逊土耳其机器人(MTurk)对简化摘要的清晰度和可理解性进行公众评估。参与者对清晰度进行评分,并通过多项选择题展示他们的理解。

结果

在9种情况下,71.7%至94.3%的评分者(36名泌尿外科医生、17名泌尿外科住院医生)认为GPT生成的回答正确。该输出由GPT生成的简化外行人摘要在9种情况中的8种(88.9%)被评为准确,在9种情况中的8种(88.9%)足以让患者做出决策。外行人摘要的平均可读性高于原始GPT输出([原始ChatGPT与简化ChatGPT,平均值(标准差),p值]弗莱什易读性:36.5(9.1)对70.2(11.2),<0.0001;冈宁雾度:15.8(1.7)对9.5(2.0),p<0.0001;弗莱什年级水平:12.8(1.2)对7.4(1.7),p<0.0001;科尔曼-廖指数:13.7(2.1)对8.6(2.4),0.0002;烟雾指数:11.8(1.2)对6.7(1.8),<0.0001;自动可读性指数:13.1(1.4)对7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验