Uzun Bektaş Aslıhan, Bora Balahan, Ünsal Erbil
Department of Pediatric Rheumatology, Dokuz Eylul University, Izmir, Turkey.
Eur J Pediatr. 2025 Jul 18;184(8):491. doi: 10.1007/s00431-025-06318-y.
Familial Mediterranean fever (FMF) is the most common monogenic autoinflammatory disease. Large language models (LLMs) offer rapid access to medical information. This study evaluated and compared the reliability, quality, and accuracy of ChatGPT-4o and LLaMA-3.1 for FMF. Single Hub and Access Point for Paediatric Rheumatology in Europe (SHARE) and European League Against Rheumatism (EULAR) guidelines were used for question generation and also for answer validation. Thirty-one questions were developed from a clinician's perspective based on the related guidelines. Two pediatric rheumatologists with over 20 years of FMF experience independently and blindly evaluated the responses. Reliability, quality, and accuracy were assessed using the modified DISCERN Scale, Global Quality Score, and the guidelines, respectively. Readability was assessed using multiple established indices. Statistical analyses included the Shapiro-Wilk test to assess normality, followed by paired t-tests for normally distributed scores, and Wilcoxon signed-rank tests for non-normally distributed scores. Both models demonstrated moderate reliability and high response quality. In terms of alignment with the guidelines, LLaMA provided fully aligned, complete, and accurate responses to 51.6% (16/31) of the questions, whereas ChatGPT provided such responses to 80.6% (25/31). While 9.7% (4/31) of LLaMA's responses were entirely contradictory to the guidelines, ChatGPT did not produce any such responses. ChatGPT outperformed LLaMA in terms of accuracy, quality, and reliability, with statistical significance. Readability assessments showed that both LLMs required college-level understanding.
While LLMs show great promise, current limitations in accuracy and guideline adherence mean they should supplement, not replace, clinical expertise.
This study does not involve clinical trials; therefore, no clinical trial registration is required.
•FMF is the most common hereditary autoinflammatory disease. LLMs are increasingly used to provide clinical information.
•This is the first study to assess two different LLMs in the context of FMF, evaluating their reliability, quality, and alignment with clinical guidelines. ChatGPT-4o outperformed LLaMA-3.1 in reliability,quality, and guideline alignment for FMF. However, both models showed informational gaps that may limit their clinical use.
家族性地中海热(FMF)是最常见的单基因自身炎症性疾病。大语言模型(LLMs)能快速获取医学信息。本研究评估并比较了ChatGPT-4o和LLaMA-3.1在FMF方面的可靠性、质量和准确性。欧洲儿科风湿病单一中心和接入点(SHARE)及欧洲抗风湿病联盟(EULAR)指南用于问题生成及答案验证。从临床医生角度依据相关指南拟定了31个问题。两名具有20多年FMF经验的儿科风湿病专家独立且盲态地评估回答。分别使用改良的DISCERN量表、全球质量评分和指南评估可靠性、质量和准确性。使用多个既定指标评估可读性。统计分析包括用Shapiro-Wilk检验评估正态性,随后对正态分布分数进行配对t检验,对非正态分布分数进行Wilcoxon符号秩检验。两个模型均显示出中等可靠性和较高的回答质量。在与指南的一致性方面,LLaMA对51.6%(16/31)的问题提供了完全一致、完整且准确的回答,而ChatGPT对80.6%(25/31)的问题提供了此类回答。LLaMA的回答中有9.7%(4/31)与指南完全矛盾,而ChatGPT未给出任何此类回答。ChatGPT在准确性、质量和可靠性方面优于LLaMA,具有统计学意义。可读性评估表明,两个大语言模型都需要大学水平的理解能力。
虽然大语言模型前景广阔,但目前在准确性和遵循指南方面的局限性意味着它们应作为临床专业知识的补充,而非替代。
本研究不涉及临床试验;因此,无需进行临床试验注册。
•FMF是最常见 的遗传性自身炎症性疾病。大语言模型越来越多地用于提供临床信息。
•这是第一项在FMF背景下评估两种不同大语言模型 的研究,评估它们的可靠性、质量以及与临床指南的一致性。ChatGPT-4o在FMF的可靠性、质量和指南一致性方面优于LLaMA-3.1。然而,两个模型都存在信息差距,可能会限制它们的临床应用。