大型语言模型在轴性脊柱关节炎管理中的性能评估：对欧洲抗风湿病联盟2022年建议的分析

Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations.

作者信息

Usen Ahmet, Kuculmez Ozlem

机构信息

Department of Physical Medicine and Rehabilitation, Medipol University, Istanbul 34810, Turkey.

Department of Physical Medicine and Rehabilitation, Baskent University Alanya Hospital, Antalya 07400, Turkey.

出版信息

Diagnostics (Basel). 2025 Jun 7;15(12):1455. doi: 10.3390/diagnostics15121455.

DOI:10.3390/diagnostics15121455

PMID:40564776

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12192445/

Abstract

: Guidelines have great importance in revealing complex and chronic conditions such as axial spondyloarthropathy. The aim of this study is to compare the answers given by various large language models to open-ended questions created from ASAS-EULAR 2022 guidance. : This was a cross-sectional and comparative study. A total of 15 recommendations in the ASAS-EULAR 2022 guideline were derived directly from their content into open-ended questions in a clinical context. Each question was asked to the ChatGPT-3.5, GPT-4o, and Gemini 2.0 Flash models, and the answers were evaluated with a seven-point Likert system in terms of usability, reliability, Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) metrics for readability, Universal Sentence Encoder (USE) and ROUGE-L for semantic and surface-level similarity. The results of different large language models were statistically compared, and < 0.05 was revealed as statistically significant. : Better FKRE and FKGL scores were obtained in the Google Gemini 2.0 program ( < 0.05). Reliability and usefulness scores were significantly higher for ChatGPT-4o and Gemini 2.0 ( < 0.05). ChatGPT-4o yielded significantly higher semantic similarity scores compared to ChatGPT-3.5 ( < 0.05). There was a negative correlation between FKRE and FKGL scores and a positive correlation between reliability and usefulness scores ( < 0.05). : It was determined that ChatGPT-4o and Gemini 2.0 programs gave more reliable, useful, and readable answers to open-ended questions derived from the ASAS-EULAR 2022 guidelines. These programs may potentially assist in supporting treatment decision-making in rheumatology in the future.

摘要

指南在揭示诸如中轴型脊柱关节炎等复杂和慢性疾病方面具有重要意义。本研究的目的是比较各种大语言模型对根据2022年ASAS - EULAR指南提出的开放式问题的回答。：这是一项横断面比较研究。2022年ASAS - EULAR指南中的15项建议直接从其内容转化为临床背景下的开放式问题。每个问题都向ChatGPT - 3.5、GPT - 4o和Gemini 2.0 Flash模型提出，并根据可用性、可靠性、弗莱什 - 金凯德易读性（FKRE）和弗莱什 - 金凯德年级水平（FKGL）指标对答案进行七点李克特系统评估以衡量可读性，使用通用句子编码器（USE）和ROUGE - L评估语义和表面级相似性。对不同大语言模型的结果进行统计学比较，P < 0.05被视为具有统计学意义。：谷歌Gemini 2.0程序获得了更好的FKRE和FKGL分数（P < 0.05）。ChatGPT - 4o和Gemini 2.0的可靠性和有用性分数显著更高（P < 0.05）。与ChatGPT - 3.5相比，ChatGPT - 4o产生的语义相似性分数显著更高（P < 0.05）。FKRE和FKGL分数之间存在负相关，可靠性和有用性分数之间存在正相关（P < 0.05）。：确定ChatGPT - 4o和Gemini 2.0程序对源自2022年ASAS - EULAR指南的开放式问题给出了更可靠、有用和易读的答案。这些程序未来可能有助于支持风湿病学中的治疗决策。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大型语言模型在轴性脊柱关节炎管理中的性能评估：对欧洲抗风湿病联盟2022年建议的分析

Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

大型语言模型在轴性脊柱关节炎管理中的性能评估：对欧洲抗风湿病联盟2022年建议的分析

Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations.

作者信息

机构信息

出版信息

相似文献

本文引用的文献