King Ryan C, Samaan Jamil S, Yeo Yee Hui, Peng Yuxin, Kunkel David C, Habib Ali A, Ghashghaei Roxana
Division of Cardiology, Department of Medicine, University of California, Irvine Medical Center, Orange, CA, United States.
Karsh Division of Gastroenterology and Hepatology, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States.
JMIR Cardio. 2024 Apr 19;8:e53421. doi: 10.2196/53421.
Amyloidosis, a rare multisystem condition, often requires complex, multidisciplinary care. Its low prevalence underscores the importance of efforts to ensure the availability of high-quality patient education materials for better outcomes. ChatGPT (OpenAI) is a large language model powered by artificial intelligence that offers a potential avenue for disseminating accurate, reliable, and accessible educational resources for both patients and providers. Its user-friendly interface, engaging conversational responses, and the capability for users to ask follow-up questions make it a promising future tool in delivering accurate and tailored information to patients.
We performed a multidisciplinary assessment of the accuracy, reproducibility, and readability of ChatGPT in answering questions related to amyloidosis.
In total, 98 amyloidosis questions related to cardiology, gastroenterology, and neurology were curated from medical societies, institutions, and amyloidosis Facebook support groups and inputted into ChatGPT-3.5 and ChatGPT-4. Cardiology- and gastroenterology-related responses were independently graded by a board-certified cardiologist and gastroenterologist, respectively, who specialize in amyloidosis. These 2 reviewers (RG and DCK) also graded general questions for which disagreements were resolved with discussion. Neurology-related responses were graded by a board-certified neurologist (AAH) who specializes in amyloidosis. Reviewers used the following grading scale: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Questions were stratified by categories for further analysis. Reproducibility was assessed by inputting each question twice into each model. The readability of ChatGPT-4 responses was also evaluated using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing).
ChatGPT-4 (n=98) provided 93 (95%) responses with accurate information, and 82 (84%) were comprehensive. ChatGPT-3.5 (n=83) provided 74 (89%) responses with accurate information, and 66 (79%) were comprehensive. When examined by question category, ChatGTP-4 and ChatGPT-3.5 provided 53 (95%) and 48 (86%) comprehensive responses, respectively, to "general questions" (n=56). When examined by subject, ChatGPT-4 and ChatGPT-3.5 performed best in response to cardiology questions (n=12) with both models producing 10 (83%) comprehensive responses. For gastroenterology (n=15), ChatGPT-4 received comprehensive grades for 9 (60%) responses, and ChatGPT-3.5 provided 8 (53%) responses. Overall, 96 of 98 (98%) responses for ChatGPT-4 and 73 of 83 (88%) for ChatGPT-3.5 were reproducible. The readability of ChatGPT-4's responses ranged from 10th to beyond graduate US grade levels with an average of 15.5 (SD 1.9).
Large language models are a promising tool for accurate and reliable health information for patients living with amyloidosis. However, ChatGPT's responses exceeded the American Medical Association's recommended fifth- to sixth-grade reading level. Future studies focusing on improving response accuracy and readability are warranted. Prior to widespread implementation, the technology's limitations and ethical implications must be further explored to ensure patient safety and equitable implementation.
淀粉样变性是一种罕见的多系统疾病,通常需要复杂的多学科护理。其低患病率凸显了努力确保提供高质量患者教育材料以实现更好治疗效果的重要性。ChatGPT(OpenAI)是一种由人工智能驱动的大型语言模型,为患者和医疗服务提供者提供了传播准确、可靠且易于获取的教育资源的潜在途径。其用户友好的界面、引人入胜的对话式回复以及用户提出后续问题的能力,使其成为向患者提供准确和量身定制信息的有前途的未来工具。
我们对ChatGPT在回答与淀粉样变性相关问题时的准确性、可重复性和可读性进行了多学科评估。
总共从医学协会、机构和淀粉样变性脸书支持小组中整理了98个与心脏病学、胃肠病学和神经病学相关的淀粉样变性问题,并输入到ChatGPT-3.5和ChatGPT-4中。与心脏病学和胃肠病学相关的回复分别由一位专门从事淀粉样变性的认证心脏病专家和认证胃肠病专家独立评分。这两位评审员(RG和DCK)也对一般问题进行评分,如有分歧则通过讨论解决。与神经病学相关的回复由一位专门从事淀粉样变性的认证神经科医生(AAH)评分。评审员使用以下评分标准:(1)全面,(2)正确但不充分,(3)部分正确部分错误,(4)完全错误。问题按类别分层以进行进一步分析。通过将每个问题输入每个模型两次来评估可重复性。还使用Python(Python软件基金会)中的Textstat库和R软件(R统计计算基金会)中的Textstat可读性包评估ChatGPT-4回复的可读性。
ChatGPT-4(n=98)提供了93条(95%)信息准确的回复,其中82条(84%)是全面的。ChatGPT-3.5(n=83)提供了74条(89%)信息准确的回复,其中66条(79%)是全面的。按问题类别检查时,ChatGTP-4和ChatGPT-3.5分别对“一般问题”(n=56)提供了53条(95%)和48条(86%)全面回复。按主题检查时,ChatGPT-4和ChatGPT-3.5在回答心脏病学问题(n=12)时表现最佳,两个模型均产生了10条(83%)全面回复。对于胃肠病学(n=15),ChatGPT-4对9条(60%)回复给出了全面评分,ChatGPT-3.5提供了8条(53%)回复。总体而言,ChatGPT-4的98条回复中有96条(98%)、ChatGPT-3.5的83条回复中有73条(88%)是可重复的。ChatGPT-4回复的可读性范围从美国小学10年级到研究生以上水平,平均为15.5(标准差1.9)。
大型语言模型是为患有淀粉样变性的患者提供准确可靠健康信息的有前途的工具。然而,ChatGPT的回复超出了美国医学协会推荐的五至六年级阅读水平。有必要开展未来研究以提高回复的准确性和可读性。在广泛应用之前,必须进一步探索该技术的局限性和伦理影响,以确保患者安全和公平应用。