Cherrez-Ojeda Ivan, Faytong-Haro Marco, Alvarez-Muñoz Patricio, Larco José Ignacio, de Arruda Chaves Erika, Rojo Isabel, Moncayo Carol Vivian, Ramon German D, Rodas-Valero Gabriela, Kocatürk Emek, Mosnaim Giselle S, Robles-Velasco Karla
Universidad Espíritu Santo, Samborondon, Ecuador.
Respiralab Research Group, Guayaquil, Ecuador.
World Allergy Organ J. 2025 Jun 14;18(7):101071. doi: 10.1016/j.waojou.2025.101071. eCollection 2025 Jul.
The increasing use of artificial intelligence (AI) in healthcare, especially in delivering medical information, prompts concerns over the reliability and accuracy of AI-generated responses. This study evaluates the quality, reliability, and readability of ChatGPT-4 responses for chronic urticaria (CU) care, considering the potential implications of inaccurate medical information.
The goal of the study was to assess the quality, reliability, and readability of ChatGPT-4 responses to inquiries on CU management in accordance with international guidelines, utilizing validated metrics to evaluate the effectiveness of ChatGPT-4 as a resource for medical information acquisition.
Twenty-four questions were derived from the EAACI/GALEN/EuroGuiDerm/APAAACI recommendations and utilized as prompts for ChatGPT-4 to obtain responses in individual chats for each question. The inquiries were categorized into 3 groups: A.) Classification and Diagnosis, B.) Assessment and Monitoring, and C.) Treatment and Management Recommendations. The responses were separately evaluated by allergy specialists utilizing the DISCERN instrument for quality assessment, Journal of the American Medical Association (JAMA) benchmark criteria for reliability evaluation, and Flesch scores for readability analysis. The scores were further examined by median calculations and Intraclass Correlation Coefficient assessments.
Categories A and C exhibited insufficient reliability according to JAMA, with median scores of 1 and 0, respectively. Category B exhibited a low reliability score (median 2, interquartile range 2). The information quality from category C questions was satisfactory (median 51.5, IQR 12.5). All 3 groups exhibited confusing readability levels according to the Flesch assessment.
The study's limitations encompass the emphasis on CU, possible bias in question selection, the use of particular instruments such as DISCERN, JAMA, and Flesch, as well as reliance on expert opinion for assessment.
ChatGPT-4 demonstrates potential for producing medical content; nonetheless, its reliability is shaky underscoring the necessity for caution and confirmation when employing AI-generated medical information, especially in the management of CU.
人工智能(AI)在医疗保健领域的应用日益增加,尤其是在提供医疗信息方面,这引发了人们对人工智能生成的回复的可靠性和准确性的担忧。本研究评估了ChatGPT-4对慢性荨麻疹(CU)护理回复的质量、可靠性和可读性,同时考虑了不准确医疗信息的潜在影响。
本研究的目的是根据国际指南,评估ChatGPT-4对有关CU管理问题的回复的质量、可靠性和可读性,利用经过验证的指标来评估ChatGPT-4作为医疗信息获取资源的有效性。
从欧洲变态反应和临床免疫学会(EAACI)/盖伦组织(GALEN)/欧洲皮肤病学指南(EuroGuiDerm)/亚太过敏、哮喘和临床免疫学会(APAAACI)的建议中提取了24个问题,并用作向ChatGPT-4提问的提示,以获取每个问题在单独聊天中的回复。这些问题分为3组:A.)分类与诊断;B.)评估与监测;C.)治疗与管理建议。过敏专科医生分别使用DISCERN工具进行质量评估、美国医学会杂志(JAMA)基准标准进行可靠性评估以及Flesch分数进行可读性分析,对回复进行评估。通过中位数计算和组内相关系数评估进一步检查分数。
根据JAMA标准,A组和C组的可靠性不足,中位数分数分别为1和0。B组的可靠性得分较低(中位数为2,四分位间距为2)。C组问题的信息质量令人满意(中位数为51.5,四分位间距为12.5)。根据Flesch评估,所有3组的可读性水平都令人困惑。
该研究的局限性包括对CU的关注、问题选择可能存在的偏差、使用DISCERN、JAMA和Flesch等特定工具,以及依赖专家意见进行评估。
ChatGPT-4显示出生成医疗内容的潜力;然而,其可靠性不稳定,这突出了在使用人工智能生成的医疗信息时,尤其是在CU管理中,谨慎和确认的必要性。