Usen Ahmet, Kuculmez Ozlem
Department of Physical Medicine and Rehabilitation, Medipol University, Istanbul 34810, Turkey.
Department of Physical Medicine and Rehabilitation, Baskent University Alanya Hospital, Antalya 07400, Turkey.
Diagnostics (Basel). 2025 Jun 7;15(12):1455. doi: 10.3390/diagnostics15121455.
: Guidelines have great importance in revealing complex and chronic conditions such as axial spondyloarthropathy. The aim of this study is to compare the answers given by various large language models to open-ended questions created from ASAS-EULAR 2022 guidance. : This was a cross-sectional and comparative study. A total of 15 recommendations in the ASAS-EULAR 2022 guideline were derived directly from their content into open-ended questions in a clinical context. Each question was asked to the ChatGPT-3.5, GPT-4o, and Gemini 2.0 Flash models, and the answers were evaluated with a seven-point Likert system in terms of usability, reliability, Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) metrics for readability, Universal Sentence Encoder (USE) and ROUGE-L for semantic and surface-level similarity. The results of different large language models were statistically compared, and < 0.05 was revealed as statistically significant. : Better FKRE and FKGL scores were obtained in the Google Gemini 2.0 program ( < 0.05). Reliability and usefulness scores were significantly higher for ChatGPT-4o and Gemini 2.0 ( < 0.05). ChatGPT-4o yielded significantly higher semantic similarity scores compared to ChatGPT-3.5 ( < 0.05). There was a negative correlation between FKRE and FKGL scores and a positive correlation between reliability and usefulness scores ( < 0.05). : It was determined that ChatGPT-4o and Gemini 2.0 programs gave more reliable, useful, and readable answers to open-ended questions derived from the ASAS-EULAR 2022 guidelines. These programs may potentially assist in supporting treatment decision-making in rheumatology in the future.
指南在揭示诸如中轴型脊柱关节炎等复杂和慢性疾病方面具有重要意义。本研究的目的是比较各种大语言模型对根据2022年ASAS - EULAR指南提出的开放式问题的回答。:这是一项横断面比较研究。2022年ASAS - EULAR指南中的15项建议直接从其内容转化为临床背景下的开放式问题。每个问题都向ChatGPT - 3.5、GPT - 4o和Gemini 2.0 Flash模型提出,并根据可用性、可靠性、弗莱什 - 金凯德易读性(FKRE)和弗莱什 - 金凯德年级水平(FKGL)指标对答案进行七点李克特系统评估以衡量可读性,使用通用句子编码器(USE)和ROUGE - L评估语义和表面级相似性。对不同大语言模型的结果进行统计学比较,P < 0.05被视为具有统计学意义。:谷歌Gemini 2.0程序获得了更好的FKRE和FKGL分数(P < 0.05)。ChatGPT - 4o和Gemini 2.0的可靠性和有用性分数显著更高(P < 0.05)。与ChatGPT - 3.5相比,ChatGPT - 4o产生的语义相似性分数显著更高(P < 0.05)。FKRE和FKGL分数之间存在负相关,可靠性和有用性分数之间存在正相关(P < 0.05)。:确定ChatGPT - 4o和Gemini 2.0程序对源自2022年ASAS - EULAR指南的开放式问题给出了更可靠、有用和易读的答案。这些程序未来可能有助于支持风湿病学中的治疗决策。