Yudovich Max S, Alzubaidi Ahmad N, Raman Jay D
Penn State Health Milton S. Hershey Medical Center, Hershey, PA, USA.
Clin Med Insights Oncol. 2024 Nov 17;18:11795549241296781. doi: 10.1177/11795549241296781. eCollection 2024.
Chat Generative Pre-Trained Transformer (ChatGPT) has previously been shown to accurately predict colon cancer screening intervals when provided with clinical data and context in the form of guidelines. The National Comprehensive Cancer Network (NCCN) guideline on non-muscle invasive bladder cancer (NMIBC) includes criteria for risk stratification into low-, intermediate-, and high-risk groups based on patient and disease characteristics. The aim of this study is to evaluate the ability of ChatGPT to apply the NCCN Guidelines to risk stratify theoretical patient scenarios related to NMIBC.
Thirty-six hypothetical patient scenarios related to NMIBC were created and submitted to GPT-3.5 and GPT-4 at two separate time points. First, both models were prompted to risk stratify patients without any additional context provided. Custom instructions were then provided as textual context using the written versions of the NMIBC NCCN Guidelines, followed by repeat risk stratification. Finally, GPT-4 was provided with an image of the NMIBC risk groups table, and the risk stratification was again performed.
GPT-3.5 correctly risk stratified 68% (24.5 of 36) of scenarios without context, slightly increasing to 74% (26.5 of 36) with textual context. Using GPT-4, the model had accuracy of 83% (30 of 36) without context, reaching 100% (36 of 36) with textual context ( = .025). GPT-4 with image context maintained similar accuracy to GPT-4 without context, with accuracy 81% (29 of 36). ChatGPT generally performed poorly when stratifying intermediate risk NMIBC (33%-63%). When risk stratification was incorrect, most responses were overestimations of risk.
GPT-4 can accurately risk stratify patients with respect to NMIBC when provided with context containing guidelines. Overestimation of risk is more common than underestimation, and intermediate risk NMIBC is most likely to be incorrectly stratified. With further validation, GPT-4 can become a tool for risk stratification of NMIBC in clinical practice.
之前的研究表明,当以指南的形式提供临床数据和背景信息时,聊天生成预训练变换器(ChatGPT)能够准确预测结肠癌筛查间隔。美国国立综合癌症网络(NCCN)关于非肌层浸润性膀胱癌(NMIBC)的指南包括根据患者和疾病特征将风险分层为低、中、高风险组的标准。本研究的目的是评估ChatGPT应用NCCN指南对与NMIBC相关的理论患者情况进行风险分层的能力。
创建了36个与NMIBC相关的假设患者情况,并在两个不同时间点提交给GPT-3.5和GPT-4。首先,在不提供任何额外背景信息的情况下,促使两个模型对患者进行风险分层。然后使用NMIBC NCCN指南的书面版本作为文本背景提供自定义说明,随后再次进行风险分层。最后,向GPT-4提供NMIBC风险组表的图像,并再次进行风险分层。
GPT-3.5在无背景信息的情况下正确对68%(36个中的24.5个)的情况进行了风险分层,在有文本背景信息时略有增加至74%(36个中的26.5个)。使用GPT-4时,该模型在无背景信息时的准确率为83%(36个中的30个),在有文本背景信息时达到100%(36个中的36个)(P = 0.025)。有图像背景信息的GPT-4与无背景信息的GPT-4保持相似的准确率,准确率为81%(36个中的29个)。ChatGPT在对中度风险NMIBC进行分层时总体表现较差(33%-63%)。当风险分层错误时,大多数回答是对风险的高估。
当提供包含指南的背景信息时,GPT-4能够准确地对NMIBC患者进行风险分层。风险高估比低估更常见,中度风险NMIBC最容易被错误分层。经过进一步验证,GPT-4可以成为临床实践中NMIBC风险分层的工具。